Sample-based Learning Methods

Sample-based Learning Methods

This course is part of Reinforcement Learning Specialization

Instructors: Martha White

Access provided by Palo Alto Networks

37,949 already enrolled

5 modules

Gain insight into a topic and learn the fundamentals.

1,254 reviews

Intermediate level

Recommended experience

Flexible schedule

2 weeks at 10 hours a week

Learn at your own pace

90%

Most learners liked this course

5 modules

Gain insight into a topic and learn the fundamentals.

1,254 reviews

Intermediate level

Recommended experience

Flexible schedule

2 weeks at 10 hours a week

Learn at your own pace

90%

Most learners liked this course

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

5 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Reinforcement Learning Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 5 modules in this course

In this course, you will learn about several algorithms that can learn near optimal policies based on trial and error interaction with the environment---learning from the agent’s own experience. Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behavior. We will cover intuitively simple but powerful Monte Carlo methods, and temporal difference learning methods including Q-learning. We will wrap up this course investigating how we can get the best of both worlds: algorithms that can combine model-based planning (similar to dynamic programming) and temporal difference updates to radically accelerate learning.

By the end of this course you will be able to: - Understand Temporal-Difference learning and Monte Carlo as two strategies for estimating value functions from sampled experience - Understand the importance of exploration, when using sampled experience rather than dynamic programming sweeps within a model - Understand the connections between Monte Carlo and Dynamic Programming and TD. - Implement and apply the TD algorithm, for estimating value functions - Implement and apply Expected Sarsa and Q-learning (two TD methods for control) - Understand the difference between on-policy and off-policy control - Understand planning with simulated experience (as opposed to classic planning strategies) - Implement a model-based approach to RL, called Dyna, which uses simulated experience - Conduct an empirical study to see the improvements in sample efficiency when using Dyna

Welcome to the second course in the Reinforcement Learning Specialization: Sample-Based Learning Methods, brought to you by the University of Alberta, Onlea, and Coursera. In this pre-course module, you'll be introduced to your instructors, and get a flavour of what the course has in store for you. Make sure to introduce yourself to your classmates in the "Meet and Greet" section!

What's included

2 videos2 readings1 discussion prompt

This week you will learn how to estimate value functions and optimal policies, using only sampled experience from the environment. This module represents our first step toward incremental learning methods that learn from the agent’s own interaction with the world, rather than a model of the world. You will learn about on-policy and off-policy methods for prediction and control, using Monte Carlo methods---methods that use sampled returns. You will also be reintroduced to the exploration problem, but more generally in RL, beyond bandits.

What's included

11 videos3 readings1 assignment1 programming assignment1 discussion prompt

11 videos Total 58 minutes

What is Monte Carlo? 7 minutes
Using Monte Carlo for Prediction 6 minutes
Using Monte Carlo for Action Values 3 minutes
Using Monte Carlo methods for generalized policy iteration 3 minutes
Solving the Blackjack Example 4 minutes
Epsilon-soft policies 5 minutes
Why does off-policy learning matter? 5 minutes
Importance Sampling 4 minutes
Off-Policy Monte Carlo Prediction 5 minutes
Emma Brunskill: Batch Reinforcement Learning 12 minutes
Week 1 Summary 4 minutes

3 readings Total 90 minutes

Module 1 Learning Objectives 10 minutes
Weekly Reading 40 minutes
Chapter Summary 40 minutes

1 assignment Total 30 minutes

Graded Quiz 30 minutes

1 programming assignment Total 5 minutes

Blackjack 5 minutes

1 discussion prompt Total 10 minutes

Comparing on-policy and off-policy learning 10 minutes

This week, you will learn about one of the most fundamental concepts in reinforcement learning: temporal difference (TD) learning. TD learning combines some of the features of both Monte Carlo and Dynamic Programming (DP) methods. TD methods are similar to Monte Carlo methods in that they can learn from the agent’s interaction with the world, and do not require knowledge of the model. TD methods are similar to DP methods in that they bootstrap, and thus can learn online---no waiting until the end of an episode. You will see how TD can learn more efficiently than Monte Carlo, due to bootstrapping. For this module, we first focus on TD for prediction, and discuss TD for control in the next module. This week, you will implement TD to estimate the value function for a fixed policy, in a simulated domain.

What's included

6 videos2 readings1 assignment1 programming assignment1 discussion prompt

6 videos Total 37 minutes

What is Temporal Difference (TD) learning? 5 minutes
Rich Sutton: The Importance of TD Learning 6 minutes
The advantages of temporal difference learning 5 minutes
Comparing TD and Monte Carlo 6 minutes
Andy Barto and Rich Sutton: More on the History of RL 12 minutes
Week 2 Summary 2 minutes

2 readings Total 50 minutes

Module 2 Learning Objectives 10 minutes
Weekly Reading 40 minutes

1 assignment Total 30 minutes

Practice Quiz 30 minutes

1 programming assignment Total 180 minutes

Policy Evaluation with Temporal Difference Learning 180 minutes

1 discussion prompt Total 10 minutes

Should we care about TD in the brain? 10 minutes

This week, you will learn about using temporal difference learning for control, as a generalized policy iteration strategy. You will see three different algorithms based on bootstrapping and Bellman equations for control: Sarsa, Q-learning and Expected Sarsa. You will see some of the differences between the methods for on-policy and off-policy control, and that Expected Sarsa is a unified algorithm for both. You will implement Expected Sarsa and Q-learning, on Cliff World.

What's included

9 videos3 readings1 assignment1 programming assignment1 discussion prompt

9 videos Total 30 minutes

Sarsa: GPI with TD 4 minutes
Sarsa in the Windy Grid World 3 minutes
What is Q-learning? 3 minutes
Q-learning in the Windy Grid World 4 minutes
How is Q-learning off-policy? 5 minutes
Expected Sarsa 4 minutes
Expected Sarsa in the Cliff World 3 minutes
Generality of Expected Sarsa 2 minutes
Week 3 Summary 2 minutes

3 readings Total 90 minutes

Module 3 Learning Objectives 10 minutes
Weekly Reading 40 minutes
Chapter summary 40 minutes

1 assignment Total 30 minutes

Practice Quiz 30 minutes

1 programming assignment Total 180 minutes

Q-Learning and Expected SARSA 180 minutes

1 discussion prompt Total 10 minutes

How can we use off-policy for learning multiple goals? 10 minutes

Up until now, you might think that learning with and without a model are two distinct, and in some ways, competing strategies: planning with Dynamic Programming verses sample-based learning via TD methods. This week we unify these two strategies with the Dyna architecture. You will learn how to estimate the model from data and then use this model to generate hypothetical experience (a bit like dreaming) to dramatically improve sample efficiency compared to sample-based methods like Q-learning. In addition, you will learn how to design learning systems that are robust to inaccurate models.

What's included

11 videos4 readings2 assignments1 programming assignment1 discussion prompt

11 videos Total 47 minutes

What is a Model? 5 minutes
Comparing Sample and Distribution Models 2 minutes
Random Tabular Q-planning 3 minutes
The Dyna Architecture 5 minutes
The Dyna Algorithm 5 minutes
Dyna & Q-learning in a Simple Maze 5 minutes
What if the model is inaccurate? 4 minutes
In-depth with changing environments 6 minutes
Drew Bagnell: self-driving, robotics, and Model Based RL 7 minutes
Week 4 Summary 2 minutes
Congratulations! 2 minutes

4 readings Total 130 minutes

Module 4 Learning Objectives 10 minutes
Weekly Reading 40 minutes
Chapter Summary 40 minutes
Text Book Part 1 Summary 40 minutes

2 assignments Total 90 minutes

Practice Assessment 45 minutes
Replacement Practice Assignment 45 minutes

1 programming assignment Total 180 minutes

Dyna-Q and Dyna-Q+ 180 minutes

1 discussion prompt Total 10 minutes

Compare Planning and Reasoning 10 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings

(223 ratings)

Martha White

University of Alberta

4 Courses 113,897 learners

Adam White

University of Alberta

4 Courses 113,897 learners

Offered by

University of Alberta

Alberta Machine Intelligence Institute

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

5 stars
82.31%
4 stars
13.22%
3 stars
2.78%
2 stars
0.63%
1 star
1.03%

Showing 3 of 1254

Reviewed on Feb 27, 2020

Itwasgoodinsubstane but there is plenty of issues with the automated grader. you spend most time dealing with the letter not on actual learning of the matter.

Reviewed on Mar 13, 2022

The videos are very clear and do a good job explaining the material from the textbook. The assignments are relevant and just right in terms of length and difficulty.

Reviewed on Feb 14, 2021

Excellent course that naturally extends the first specialization course. The application examples in programming are very good and I loved how RL gets closer and closer to how a living being thinks.

View more reviews