Up to this point, we've generally talked about a policy as something that is given. The policy specifies how an agent behaves. Given this way of behaving, we then aim to find the value function. But the goal of reinforcement learning is not just to evaluate specific policies. Ultimately, we want to find a policy that obtains as much reward as possible in the long run. In this video, we will make this notion precise with the idea of an optimal policy. By the end of this video, you will be able to define an optimal policy, understand how policy can be at least as good as every other policy in every state, and identify an optimal policy for a given MDP. To define an optimal policy, we first have to understand what it means for one policy to be better than another. Here we can see the value of two policies plotted across states. Note that we plot it as a continuous line for illustration only. We are still considering discrete states, and the value might not very smoothly across states. This plot illustrates that in some states, Pi one achieves a higher value, and in other states Pi two achieves a higher value. So it does not make much sense to say Pi one is better than Pi two, or that Pi two is better than Pi one. We will say policy Pi one is as good as or better than policy Pi two, if and only if the value under Pi one is greater than or equal to the value under Pi two for every state. We denote this relationship between policies with a greater than or equal to sign. In the diagram shown here, the line visualizing the value of Pi one is always above the line for Pi two. So in this case, Pi one is as good as or better than Pi two. An optimal policy, is a policy which is as good as or better than all the other policies. That is, an optimal policy will have the highest possible value in every state. There's always at least one optimal policy, but there may be more than one. We'll use the notation Pi star to denote any optimal policy. It might not be clear why there must exist optimal policies. That is, policies that are at least as good as all other policies and every state. Couldn't we have a situation we're doing well in one state requires doing badly and another. Let's say there is such a policy Pi one, which does well in some states while policy Pi two, does well and others. We could combine these policies into a third policy Pi three, which always chooses actions according to whichever of policy Pi one and Pi two has the highest value in the current state. Pi three will necessarily have a value greater than or equal to both Pi one and Pi two in every state. So we will never have a situation we're doing well in one state require sacrificing value in another. Because of this, there always exists some policy which is best in every state. This is of course only an informal argument, but there's in fact a rigorous proof showing that there must always exist at least one optimal deterministic policy. Notice, that in some states Pi three has a strictly greater value than either Pi one or Pi two. As an exercise, try to explain how this is possible given that Pi three simply chooses actions according to whichever of Pi one and Pi two has a higher value in each state. Let's look at a specific example to build some intuition about optimal policies. Consider the two choice MDP shown here. The only decision to be made is in the top state labeled X. The agent can take either action A1 or A2. From state X, action A1, takes the agent to state Y. In state Y, only action A1 is available and it takes the agent back to state X. On the other hand action A2 and X takes the agent to state Z. From state Z action A1 is again the only action available and it takes the agent back to state X. The numbers show the rewards the agent receives after each action. Notice that while A1 offers an immediate reward of plus one, A2 offers a larger reward of plus two after single-step delay. There are only two deterministic policies in this MDP which are completely defined by the agents choice in state X. Take action A1 or take action A2. Let's call these Pi one and Pi two. Which of these two policies is optimal? The optimal policy what will be the one for which the value of X is highest. The answer depends on the discount factor Gamma. Consider Gamma equals zero. In this case, the value is defined using only the immediate reward. The value of state X under Pi one is plus one, while the value under Pi two is zero because the plus to reward occurs after a one-step delay, which does not affect the return when Gamma is set to zero. So in this case, Pi one is the optimal policy. What if instead Gamma equals 0.9? In this case, the value of X under each policy is an infinite sum. Pi one receives an immediate reward of one followed by zero, and then one again, and so on. Each reward and the expression for the value of state X is discounted by some power of Gamma. We can express this compactly as a geometric series. Applying the geometric series formula, we get the value shown here for Pi one which evaluates to around 5.3. We can write the value under Pi two. Similarly, except we will receive a reward of two on every odd step instead of one on every even step. We can again write this as a geometric series and obtain a closed form solution. In this case, the solution evaluates to around 9.5. Since 9.5 is higher than 5.3, Pi two is optimal in this case. In these two choice MDP, finding the optimal policy was relatively straightforward. There were only two deterministic policies, and we simply had to compute the value function for each of them. In general, it will not be so easy. Even if we limit ourselves to deterministic policies, the number of possible policies is equal to the number of possible actions to the power of the number of states. We could use a brute force search where we compute the value function for every policy to find the optimal policy. But it's not hard to see this will become intractable for even moderately large MDPs. Luckily, there's a better way to organize our search of the policy space. The solution will come in the form of yet another set of Bellman equations, called the Bellman's Optimality equations, which we will introduce in an upcoming video. That's it for this video. The most important things to take away are an optimal policy is defined as the policy with the highest value in all states. At least one optimal policy always exists but there may be more than one. The exponential number of possible policies, makes searching for the optimal policy by brute force intractable.