You saw in the last video, what are the states of reinforcement learning application, as well as how depending on the actions you take you go through different states, and also get to enjoy different rewards. But how do you know if a particular set of rewards is better or worse than a different set of rewards? The return in reinforcement learning, which we'll define in this video, allows us to capture that. As we go through this, one analogy that you might find helpful is if you imagine you have a five-dollar bill at your feet, you can reach down and pick up, or half an hour across town, you can walk half an hour and pick up a 10-dollar bill. Which one would you rather go after? Ten dollars is much better than five dollars, but if you need to walk for half an hour to go and get that 10-dollar bill, then maybe it'd be more convenient to just pick up the five-dollar bill instead. The concept of a return captures that rewards you can get quicker are maybe more attractive than rewards that take you a long time to get to. Let's take a look at exactly how that works. Here's a Mars Rover example. If starting from state 4 you go to the left, we saw that the rewards you get would be zero on the first step from state 4, zero from state 3, zero from state 2, and then 100 at state 1, the terminal state. The return is defined as the sum of these rewards but weighted by one additional factor, which is called the discount factor. The discount factor is a number a little bit less than 1. Let me pick 0.9 as the discount factor. I'm going to weight the reward in the first step is just zero, the reward in the second step is a discount factor, 0.9 times that reward, and then plus the discount factor^2 times that reward, and then plus the discount factor^3 times that reward. If you calculate this out, this turns out to be 0.729 times 100, which is 72.9. The more general formula for the return is that if your robot goes through some sequence of states and gets reward R_1 on the first step, and R_2 on the second step, and R_3 on the third step, and so on, then the return is R_1 plus the discount factor Gamma, this Greek alphabet Gamma which I've set to 0.9 in this example, the Gamma times R_2 plus Gamma^2 times R_3 plus Gamma^3 times R_4, and so on, until you get to the terminal state. What the discount factor Gamma does is it has the effect of making the reinforcement learning algorithm a little bit impatient. Because the return gives full credit to the first reward is 100 percent is 1 times R_1, but then it gives a little bit less credit to the reward you get at the second step is multiplied by 0.9, and then even less credit to the reward you get at the next time step R_3, and so on, and so getting rewards sooner results in a higher value for the total return. In many reinforcement learning algorithms, a common choice for the discount factor will be a number pretty close to 1, like 0.9, or 0.99, or even 0.999. But for illustrative purposes in the running example I'm going to use, I'm actually going to use a discount factor of 0.5. This very heavily down weights or very heavily we say discounts rewards in the future, because with every additional parsing timestamp, you get only half as much credit as rewards that you would have gotten one step earlier. If Gamma were equal to 0.5, the return under the example above would have been 0 plus 0.5 times 0, replacing this equation on top, plus 0.5^2 0 plus 0.5^3 times 100. That's lost reward because state 1 to terminal state, and this turns out to be a return of 12.5. In financial applications, the discount factor also has a very natural interpretation as the interest rate or the time value of money. If you can have a dollar today, that may be worth a little bit more than if you could only get a dollar in the future. Because even a dollar today you can put in the bank, earn some interest, and end up with a little bit more money a year from now. For financial applications, often, that discount factor represents how much less is a dollar in the future where I've compared to a dollar today. Let's look at some concrete examples of returns. The return you get depends on the rewards, and the rewards depends on the actions you take, and so the return depends on the actions you take. Let's use our usual example and say for this example, I'm going to always go to the left. We already saw previously that if the robot were to start off in state 4, the return is 12.5 as we worked out on the previous slide. It turns out that if it were to start off in say three, the return would be 25 because it gets to the 100 reward one step sooner, and so it's discounted less. If it were to start off in state 2, the return would be 50. If it were to just start off and state 1, well, it gets the reward of 100 right away, so it's not discounted low. The return if we were to start out in state 1 will be 100, and then the return in these two states are 6.25. It turns out if you start off in state 6, which is terminal state, you just get the reward and thus the return of 40. Now, if you were to take a different set of actions, the returns would actually be different. For example, if we were to always go to the right, if those were our actions, then if you were to start in state 4, get a reward of 0. Then you get to state 5, get a reward of 0, and it gets to state 6, and get a reward of 40. In this case, the return would be 0 plus 0.5, the discount factor times 0 plus 0.5 squared times 40, and that turns out to be equal to 0.5 squared is 1/4, so 1/4 of 40 is 10. The return from this state, from state 4 is 10. If you were to take actions, always go to the right. Through similar reasoning, the return from this state is 20, the return from this state is five, the return from this state is 2.5, and then the return, the determinant state is is 140. By the way, if these numbers don't fully make sense, feel free to pause the video and double-check the math and see if you can convince yourself that these are the appropriate values for the return. For if you start from different states, and if you were to always go to the right. We see that it would always go to the right. The return you expect to get is lower for most states. Maybe always going to the right isn't as good an idea as always going to the left. But it turns out that we don't have to always go to the left, always go to the right. We could also decide if you're in state 2, go left. If your in state 3, go left. If you're in state 4, go left. But if you're in state 5, then you're so close to this reward. Let's go right. This will be a different way of choosing actions to take based on what state you're in. It turns out that the return you get from the different states will be 100, 50, 25, 12.5, 20, and 40. Just to illustrate one case. If you were to start off in state 5, here you would go to the right, and so the rewards you get would be zero first in state 5, and then 4. The return is zero, the first reward, plus the discount factor is 0.5 times 40, which is 20, which is why the return from this status 20 if you take actions shown here. To summarize, the return in reinforcement learning is the sum of the rewards that the system gets, weighted by the discount factor, where rewards in the far future are weighted by the discount factor raised to a higher power. Now, this actually has an interesting effect when you have systems with negative rewards. In the example we went through, all the rewards were zero or positive. But if there are any rewards are negative, then the discount factor actually incentivizes the system to push out the negative rewards as far into the future as possible. Taking a financial example, if you had to pay someone $10, maybe that's a negative reward of minus 10. But if you could postpone payment by a few years, then you're actually better off because $10 a few years from now, because of the interest rate is actually worth less than $10 that you had to pay today. For systems with negative rewards, it causes the algorithm to try to push out the make the rewards as far into the future as possible. For financial applications and for other applications, that actually turns out to be right thing for the system to do. You now know what is the return in reinforcement learning, let's go on to the next video to formalize the goal of reinforcement learning algorithm.