Let me summarize where we are. If you can compute the state action value function Q of S, A, then it gives you a way to pick a good action from every scene. Just pick the action A, that gives you the largest value of Q of S,A. The question is, how do you compute these values Q of S,A? In reinforcement learning, there's a key equation called the Bellman equation that will help us to compute the state action value function. Let's take a look at what is this equation. As a reminder, this is the definition of Q of S, A. It's return if we start in state S, take the action a once and they behave optimally after that. In order to describe the Bellman equation, I'm going to use the following notation. I'm going to use S to denote the current state. Next, I'm going to use R of S to denote the rewards of the current state. For our little MDP example, we will have that r of one State1 is 100. The reward of State 2 is 0, and so on. The reward of State 6 is 40. I'm going to use the alphabet A to denote the current action, the action that you take in the state S. After you take the action a, you get to some new state. For example, if you're in State 4 and you take the action left, then you get to State 3. I'm going to use S prime to denote the state you get to after taking that action a from the current state S. I'm also going to use A prime to denote the action that you might take in state S prime, the new steady you got to. The notation convention, by the way, is that S,A correspond to the current state and action. When we add the prime, that's the next state, then the next action. The Bellman equation is the following. It says that Q of S,A, that is the return under this set of assumptions that's equal to r of S, says reward you get for being in that state plus the discount factor Gamma times max over all possible actions, a prime of q of S prime, the new state you just got to, and then a prime. There's a lot going on in this equation. Let's first take a look at some examples. We'll come back to see why this equation might make sense. Let's look at an example. Let's look at Q of State 2 and action. Apply Bellman Equation to this to see what value it gives us. If the current state is state two and that the action is to go right, then the next day you get to, after going write S prime would be the State 3. The Bellman equation says Q of 2, right is R of S. This R State 2, which is just the rewards zero plus the discount factor Gamma, which we set to 0.5 in this example, times max of the Q values in state S prime in State 3. This is going to be the max of 25 and 6.25, since this is max over a prime of q of S prime comma a prime. This is taking the larger of 25 or 6.25 because those are the two choices for State 3. This turns out to be equal to zero plus 0.5 times 25, which is equal to 12.5, which fortunately is Q of two and then the action right. Let's look at just one more example. Let me take the State 4 and see what is Q of State 4 if you decide to go left. In this case, the current state is four current action is to go left. The next state, if you can start from four going left. You end up also at State 3. Let us prime this three again, the Bellman Equation, we'll say this is equal to R of S. Our State four, which is zero plus 0.5 the discount factor Gamma of max over a prime of q of S prime. That is the State 3 again, comma a prime. Once again, the Q values for State 3 are 25 and 6.25 and the larger of these is 25. This works out to be our 40 plus 0.5 times 25, which is again equal to 12.5. That's why q of four with the action left is also equal to 12.5, just one note, if you're in a terminal state, then Bellman Equation simplifies to q of SA equals to r of S because there's no state S prime and so that second term would go away. Which is why Q of S,A in the terminal states is just 100, 100 or 40 40. If you wish feel free to pause the video and apply the Bellman Equation to any other state action in this MDP and check for yourself if this math works out. Just to recap, this is how we had define Q of S,A. We saw earlier that the best possible return from any state S is max over a Q of S,A. In fact, just to rename SNA, it turns out that the best possible return from a state S prime, is max over S prime of a prime. I didn't really do anything other than rename S, S prime and a to a prime. But this will make some of the intuitions a little bit easier later. But for any state S prime, like State 3, the best possible return from, say, State 3 is the max over all possible actions of Q of S prime E prime. Here again is the Bellman equation. The intuition that this captures is if you're starting from state s and you're going to take action a and then act optimally after that, then you're going to see some sequence of rewards over time. In particular, the return will be computed from the reward at the first step, plus Gamma times reward at the second step plus Gamma squared times reward at the third step, and so on. Plus dot, dot, dot until you get to the terminal state. What Bellman equation says is this sequence of rewards, what the discount factor is, can be broken down into two components. First, this R of s, that's the reward you get right away. In the reinforcement learning literature, this is sometimes also called the immediate reward, but that's what R_1 is. It's the reward you get for starting out in some state s. The second term then is the following; after you start in state s and take action a, you get to some new state s prime. The definition of Q of s, a assumes we're going to behave optimally after that. After we get to s prime, we are going to behave optimally and get the best possible return from the state s prime. What this is, max of a prime of Q of s prime a prime, this is the return from behaving optimally, starting from the state s prime. That's exactly what we had written up here, is the best possible return for when you start from state s prime. Another way of phrasing this is this total return down here is also equal to R_1 plus, and then we're going to factor out Gamma in the map, is Gamma times R_2 plus, and instead of Gamma squared is just Gamma times R_3 plus Gamma squared times R_4 plus dot dot dot. Notice that if you were starting from state s prime, the sequence of rewards you get will be R_2, R_3, then R_4, and so on. That's why this expression here, that's the total return if you were to start from state s prime. If you were to behave optimally, then this expression should be the best possible return for starting from state s prime, which is why this sequence of discount rewards equals that max of a prime of Q of s prime a prime and there were also leftover with this extra discount factor Gamma there, which is why Q of s, a is also equal to this expression over here. In case you think this is quite complicated and you aren't following all the details, don't worry about it. So long as you apply this equation, you will manage to get the right results. But the high level intuition I hope you take away is that the total return you get in the reinforcement learning problem has two parts. The first part is this reward that you get right away, and then the second part is Gamma times the return you get starting from the next state s prime. As these two components together, R of s plus Gamma times the return from the next state, that is equal to the total return from the current state s. That is the essence of the Bellman equation. Just to relate this back to our earlier example, Q of 4, left. That's the total return for starting State 4 and going left. If you were to go left in State 4 the rewards you get are 0 in State 4, 0 in State 3, 0 in State 2, and then 100, which is why the total return is this; 0.5 squared plus 0.5 cubed, which was 12.5. What Bellman equation is saying is that we can break this up into two pieces. There is this zero, which is R of the state four, and then plus 0.5 times this other sequence, 0 plus 0.50 plus 0.5 squared times 100. But if you look at what this sequence is, this is really the optimal return from the next state s prime that you got to after taking the action left from state four. That's why this is equal to the reward 4 plus 0.5 times the optimal return from State 3. Because if you were to start from State 3, the rewards you get would be zero followed by zero followed by 100, so this is optimal return from State 3 and that's why this is just R of 4 plus 0.5 max over a prime Q of State 3, a prime. I know the Bellman equation, this is somewhat complicated equation breaking down your total returns into the reward you're getting right away. The immediate reward plus Gamma times the returns from the next state s prime. If it makes sense to you, but not fully, it's okay. Don't worry about it. You can still apply Bellman's equations to get a reinforcement learning algorithm to work correctly, but I hope that at least the high level intuition of why breaking down the rewards into what you get right away plus what you get in the future. I hope that makes sense. Before moving on to develop a reinforcement learning algorithm, we have coming up next an optional video on Stochastic Markov decision processes or on reinforcement learning applications where the actions, if you take, can have a slightly random effect. Take look at the optional video if you wish. Then after that, we'll start to develop a reinforcement learning algorithm.