We just learned how the value function computed for a given policy can be used to find a better policy. In this video, we will show how we can use this to find the optimal policy by iteratively evaluating and proving a sequence of policies. By the end of this video, you will be able to outline the policy iteration algorithm for finding the optimal policy, understand the dance of policy and value, how policy iteration reaches the optimal policy by alternating between evaluating policy and improving it, and apply policy iteration to compute optimal policies and optimal value functions. Recall the policy improvement theorem. It tells us that we can construct a strictly better policy by acting greedily with respect to the value function of a given policy, unless the given policy was already optimal. Let's say we begin with the policy Pi 1. We can evaluate Pi 1 using iterative policy evaluation to obtain the state value, V Pi 1. We call this the evaluation step. Using the results of the policy improvement theorem, we can then greedify with respect to v Pi 1 to obtain a better policy, Pi 2. We call this the improvement step. We can then compute V Pi 2 and use it to obtain an even better policy, Pi 3. This gives us a sequence of better policies. Each policy is guaranteed to be an improvement on the last unless the last policy was already optimal. So when we complete an iteration, and the policy remains unchanged, we know we have found the optimal policy. At that point, we can terminate the algorithm. Each policy generated in this way is deterministic. There are finite number of deterministic policies, so this iterative improvement must eventually reach an optimal policy. This method of finding an optimal policy is called policy iteration. Policy iteration consists of two distinct steps repeated over and over, evaluation and improvement. We first evaluate our current policy, Pi 1, which gives us a new value function that accurately reflects the value of Pi 1. The improvement step then uses V Pi 1 to produce a greedy policy Pi 2. At this point, Pi 2 is greedy with respect to the value function of Pi 1, but V Pi 1 no longer accurately reflects the value of Pi 2. The next step evaluation makes our value function accurate with respect to the policy Pi 2. Once we do this, our policy is once again not greedy. This dance of policy and value proceeds back and forth, until we reach the only policy, which is greedy with respect to it's own value function, the optimal policy. At this point, and only at this point, the policy is greedy and the value function is accurate. We can visualize this dance as bouncing back and forth between one line, where the value function is accurate, and another where the policy is greedy. These two lines intersect only at the optimal policy and value function. Policy iteration always makes progress towards the intersection by projecting first onto the line v equals v Pi, and then onto the line where Pi is greedy with respect to v. Of course, the real geometry of the space of policies and value functions is more complicated, but the same intuition holds. Here's what this procedure looks like in pseudocode. We initialize v and Pi in any way we like for each state s. Next, we call iterative policy evaluation to make V reflect the value of Pi. This is the algorithm we learned earlier in this module. Then, in each state, we set Pi to select the maximizing action under the value function. If this procedure changes the selected action in any state, we note that the policy is still changing, and set policy stable to force. After completing step 3, we check if the policy is stable. If not, we carry on and evaluate the new policy. Let's look at how this works on a simple problem to build some intuition. Remember the four-by-four grid ruled example we used to demonstrate iterative policy evaluation. Previously, we showed that by evaluating the random policy, and greedifying just once, we could find the optimal policy. This is not a very interesting case for policy iteration. Let's modify this problem a little bit to make the control task a bit harder. First, let's remove one of the terminal states so that there's only one way to end the episode. Previously, each state admitted a reward of minus 1. Instead, let's add some especially bad states. These bad states are marked in blue. Transitioning into them gives a reward of negative 10. The optimal policy should follow the winding low cost path in white to the terminal state. This additional complexity means that policy iteration takes several iterations to discover the path. Let's see how this plays out. First, we initialize a policy and value function. As before, we choose the uniform random policy, and set the value estimate to zero for all states. The first step is to use iterative policy evaluation to evaluate the uniform random policy. Since you've seen how this works before, let's skip straight to the result. The values are quite negative everywhere, those slightly less so in states near the goal. Next we perform the improvement step. You've seen how greedification works before. So again, let's skip to the result. Notice that near the terminal state, the policy correctly follows the low cost path toward the terminal state. In the states in the bottom row however, the policy instead takes them more direct, but lower value path through the bad states. Let's evaluate this new policy. Notice how after just one improvement, the values are starting to look much more reasonable, but we aren't finished yet. Let's greedify again. Remember, the policy improvement theorem tells us that this new policy is an improvement on the last one, unless we have already reached the optimum. Specifically, the bottom-right state now goes straight up along the low cost path. One more step of policy evaluation reflects this change. The value of the bottom right state goes from minus 15 to just minus 6. Let's keep going until we reach the optimal policy. One more step of improvement improves the action selection and yet another state. The next step of policy evaluation reflects this change, improve again, and evaluate, and improve. Now, we can see that the policy has reached the optimum, and follows a low cost path avoiding the blue states. Evaluating one more time gives us the optimal value function. If we try to greedify again, the policy remains unchanged. This tells us that policy iteration is complete, and the optimal policy has been found. This example shows the power of policy iteration, in that it guarantees we can follow a sequence of increasingly better policies until we reach an optimal policy. Policy iteration cuts through the search space, which is key when the optimal policy is not straightforward, in this case literally. The same complexity will come up and problems we really care about. In this week's assessment, you will have a chance to implement policy iteration on a slightly more realistic example. That's it for this video. You should now understand how policy iteration works by alternating between policy evaluation and policy improvement, and how policy iteration follows a sequence of better policies and value functions until it reaches the optimum policy and optimal value function. See you next time.