In the previous video, we discussed episodic problems. In many problems however, the agent environment interaction continues without end. Today, we will see how such problems can be formulated as continuing tasks. In this video, you will learn to differentiate between episodic and continuing tasks, formulate returns for continuing tasks using discounting, and describe how returns at successive time steps are related to each other. Let's look at the differences between episodic and continuing tasks. As we discussed earlier, episodic tasks break up into episodes. Every episode in an episodic task must end in a terminal state. The next episode begins independently of how the last episode ended. The return at time step t is the sum of rewards until termination. In contrast, continuing tasks cannot be broken up into independent episodes. The interaction goes on continually. There are no terminal states. To make this more concrete, consider a smart thermostat which regulates the temperature of a building. This can be formulated as a continuing task since the thermostat never stops interacting with the environment. The state could be the current temperature along with details of the situation like the time of day and the number of people in the building. There are just two actions, turn on the heater or turn it off. The reward to be minus one every time someone has to manually adjust the temperature and zero otherwise. To avoid negative reward, the thermostat would learn to anticipate the user's preferences. So how can we formulate the return for continuing tasks? We can try to sum up all the future rewards as we did for episodic tasks. But now, we're summing over an infinite sequence. This return might not be finite. So how can we modify this sum so that is always finite? One solution is to discount future rewards by a factor Gamma called the discount rate. Gamma is at least zero, but less than one. The return formulation can then be modified to include discounting. The effect of discounting on the return is simple, immediate rewards contribute more to the some. Rewards far into the future contribute less because they are multiplied by Gamma raised to successively larger powers of k. Intuitively, this choice makes sense. A dollar today is worth more to you than a dollar in a year. We can concisely write this sum as this expression, which is guaranteed to be finite. Let's see why? Assume R_max is the maximum reward our aging can receive at any time step. We can now upper bound the return G_t by replacing every reward with R_ max. Since R_max is just a constant we can pull it out of the summation. Note that the second factor is just a geometric series and the geometric series evaluates to one divided by one minus Gamma, R_max times one divided by one minus Gamma is finite and is an upper bound on G_t. So we know G_t is finite. Now, let's look at the effect of the discount factor on the behavior of the agent. We can look at the two extreme cases when Gamma equals zero and when Gamma approaches one. When Gamma equals zero the return is just the reward at the next time step. So the agent is shortsighted and only cares about immediate expected reward. On the other hand, when Gamma approaches one, the immediate and future rewards are weighted nearly equally in the return. The agent in this case is more farsighted. Finally, let's discuss a simple but important property of the return. It can be written recursively. Let's factor out Gamma starting from the second term in our sum. Amazingly, the sequence in parentheses is the return on the next time step. So we can just replace it with G_t plus 1. Now, we have a recursive equation with G_t on the left and G_t plus 1 on the right. This simple equation is more powerful than it seems. In future videos, we'll exploit this equation to design learning algorithms. To recap, we learned about continuing tasks where the agent environment interaction goes on indefinitely. Discounting is used to ensure returns are finite and we saw that returns can be defined recursively. See you next time.