In some applications, when you take an action, the outcome is not always completely reliable. For example, if you command your Mars rover to go left maybe there's a little bit of a rock slide, or maybe the floor is really slippery and so it slips and goes in the wrong direction. In practice, many robots don't always manage to do exactly what you tell them because of wind blowing and off course and the wheel slipping or something else. There's a generalization of the reinforcement learning framework we've talked about so far, which models random or stochastic environments. In this optional video, we'll talk about how these reinforcement learning problems work, continuing with our simplifying Mars Rover example, let's say you take the action and command it to go left. Most of the time you'll succeed but what if 10 percent of the time or 0.1 of the time, it actually ends up accidentally slipping and going in the opposite direction? If you command it to go left, it has a 90 percent chance or 0.9 chance of correctly going in the left direction. But the 0.1 chance of actually heading to the right so that it has a 9 percent chance of ending up in say three in this example and a 10 percent chance of ending up in state five. Conversely, if you were to command it to go right and take the action, right, it has a 0.9 chance of ending up in state five and 0.1 chance of ending up in state three. This would be an example of a stochastic environment. Let's see what happens in this reinforcement learning problem. Let's say you use this policy shown here, where you go left in stages 2 3 4 and go rights or try to go right in state five. If you were to start in state four and you were to follow this policy, then the actual sequence of states you visit may be random. For example, in state four, you will go left, and maybe your loop and lucky, and it actually gets the state three, and then you try to go left again, and maybe it actually gets there. You tell it to go left again, and it gets to that state. If this is what happens, you end up with the sequence of rewards 000100. But if you were to try this exact same policy a second time, maybe you're a little less lucky, the second time you start here. Try to go left and see it succeeds so a zero from state four zero from state three, hear you tell it to go left, but you've got unlucky this time and the robot slips and ends up heading back to state four instead. Then you're taught to call left, and left, and left, and eventually get to that reward of 100. In that case, this will be the sequence of rewards you observe. This one from four to three to four three two then one, or is even possible, if you tell from state four to go left following the policy you may get unlucky even on the first step and you end up going to state five because it slipped. Then state five, you command it to go right, and it succeeds as you end up here. In this case, the sequence of rewards you see will be 0040, because it went from four to five, and then states six, we had previously written out the return as this sum of discounted rewards. But when the reinforcement learning problem is stochastic, there isn't one sequence of rewards that you see for sure instead you see this sequence of different rewards. In a stochastic reinforcement learning problem, what we're interested in is not maximizing the return because that's a random number. What we're interested in is maximizing the average value of the sum of discounted rewards. By average value, I mean if you were to take your policy and try it out a thousand times or a 100,000 times or a million times, you get lots of different reward sequences like that and if you were to take the average over all of these different sequences of the sum of discounted rewards, then that's what we call the expected return. In statistics, the term expected is just another way of saying average. But what this means is we want to maximize what we expect to get on average in terms of the sum of discounted rewards. The mathematical notation for this is to write this as E. E stands for expected value of R1 plus Gamma R2 plus, and so on. The job of reinforcement learning algorithm is to choose a policy Pi to maximize the average or the expected sum of discounted rewards. To summarize, when you have a stochastic reinforcement learning problem or a stochastic Markov decision process the goal is to choose a policy to tell us what action to take in state S so as to maximize the expected return. The last way that this changes, what we've talked about is it modifies Bellman equation a little bit. Here's the Bellman equation exactly as we've written down. But the difference now is that when you take the action a in state s, the next state s prime you get to is random. When you're in state 3 and you tell it to go left the next state s prime it could be the state 2, or it could be the state 4. S prime is now random, which is why we also put an average operator or unexpected operator here. We say that the total return from state s, taking action a, once in a behaving optimally, is equal to the reward you get right away, also called the immediate reward plus the discount factor, Gamma plus what you expect to get on average of the future returns. If you want to sharpen your intuition about what happens with these stochastic reinforcement learning problems. You'd go back to the optional lab that I had shown you just now, where this parameter misstep problem is the probability of your Mars Rover going in the opposite direction, than you had commanded it to. If we said misstep prop two is 0.1 and re-execute the Notebook and so these numbers up here are the optimal return if you were to take the best possible actions, take this optimal policy but the robot were to step in the wrong direction 10 percent of the time and these are the q values for this stochastic NTP. Notice that these values are now a little bit lower because you can't control the robot as well as before. The q values, as well as the optimal returns, have gone down a bit. In fact, if you were to increase the misstep probability, say 40 percent of the time the robot doesn't even go into directions. You're commanding it to only 60 percent of the time. It goes where you told it to, then these values end up even lower because your degree of control over the robot has decreased. I encourage you to play with the optional lab and change the value of the misstep probability and see how that affects the for return or the auto expected return, as well as the Q values, q of s a. Now, in everything we've done so far, we've been using this Markov decision process, this Mars rover with just six states. For many practical applications, the number of states will be much larger. In the next video, we'll take the reinforcement learning or Markov decision process framework we've talked about so far and generalize it to this much richer and maybe even more interesting set of problems with much larger and in particular with continuous state spaces, let's take a look at that in the next video.