The term Monte Carlo is often used more broadly for any estimation method that relies on repeated random sampling. In RL Monte Carlo methods allow us to estimate values directly from experience, from sequences of states, actions and rewards. Learning from experience is striking because the agent can accurately estimate a value function without prior knowledge of the environments dynamics. By the end of this video you will be able to understand how Monte Carlo methods can be used to estimate value functions from sampled interaction and identify problems that can be solved using Monte Carlo methods. Earlier we talked about how reinforcement learning is connected to dynamic programming. To use a pure dynamic programming approach, the agent needs to know the environment's transition probabilities. In some problems we simply don't know these probabilities. Imagine a meteorologist is trying to predict the weather. How the weather changes depends on a variety of environmental factors. We simply don't know the exact probability of weather patterns in the future. Calculating the transition probabilities used in dynamic programming is difficult even for reasonable tasks. Think about predicting the outcome of 12 rolls of dice, the sheer number and complexity of DP calculations makes them tedious and error-prone both in terms of coding and numerical precision. Here's where Monte Carlo methods can help. Let's try to find the average sum of 12 dice rolls. Monte Carlo methods don't exhaustively sweep through all possible outcomes like the ones shown here. In fact, you don't need to know the probability of any outcomes to be able to use Monte Carlo methods. Instead, Monte Carlo methods estimate values by averaging over a large number of random samples. Let's roll 12 dice a few times and see the result. In this case, the average is 41.57, fairly close to the true average of 42. In reinforcement learning we want to learn a value function. Value functions represent expected returns. So a Monte Carlo method for learning a value function would first observe multiple returns from the same state. Then, it average those observed returns to estimate the expected return from that state. As the number of samples increases, the average tends to get closer and closer to the expected return. The more returns the agent observes from a state, the more likely it is that the sample average is close to the state value. These returns can only be observed at the end of an episode. So we will focus on Monte Carlo methods for episodic tasks. Monte Carlo methods in reinforcement learning look a bit like bandit methods. In bandits the value of an arm is estimated using the average payoff sampled by pulling that arm. Monte Carlo methods consider policies instead of arms. The value state S under a given policy is estimated using the average return sampled by following that policy from S to termination. Now let's look at an algorithm for estimating the state value function of a policy. The Monte Carlo algorithm has to keep track of multiple observed returns. Let's introduce a list and returns one for each state. Each list holds the returns observed from state S. Then we generate an episode by following our policy. For each date in the episode, we compute the return and start in the list of returns, but how can we do that efficiently? Suppose the discount factor is 0.05 and imagine an episode ending at time step five where the reward sequences is 3, 4, 7, 1, and two. Let's find each return G_0 to G_5, starting by writing down the equation for each return. Notice that each return is included in the equation for the previous time steps return. That means we can avoid duplicating computations by starting at G_5 and working our way backwards. Let's do that now. The episode ends at t equal to five so G_5 equals zero by definition. G_4 is equal to the reward at time five plus Gamma times the return of five which adds to two. G_3 is equal to the reward at time four plus Gamma times the return at time four. The reward at time four is one and we just calculated that the return at time four is two. Solving the equation we can see that the return at time three is also two. Continuing on, we can find the return at time two is seven plus Gamma times two which is eight. Finally, the first two returns, G_1 and G_0 are eight and seven. By working backwards from the terminal time-step, we can efficiently compute the returns for each state encountered during the episode. The first return in just the last rewards. So we add the last reward to the list of returns for S_t minus one. Then, we set the value of S_t minus one to be the average of returns S_t minus one. On the previous time step t minus two, we calculate the return as before then added to the list of returns for S_t minus two. Finally we update the value of S_t minus two. If we continue this loop until the end we'll, have updated the values for all the states visited in the current episode. Then we can repeat the whole process over many episodes and eventually learn a good estimate for the value function. You might be wondering if we can avoid keeping all the sampled returns in a list. In fact, we can. We can incrementally update the sample average estimated using the formula here. Recall what we discussed using such incremental updates when estimating action values for bandits. We use the conceptually simpler sample average in this module to focus on the key ideas from Monte Carlo. After this module, we will switch to using the incremental update. In this video, we talked about how Monte Carlo methods learn directly from interaction. Notably, they don't need a model the environments dynamics. We also showed a Monte Carlo algorithm for learning state values and episodic problems.