At the end of a full sweep,

we can write all the new values into V;

then we do the next iteration.

It is also possible to implement

a version with only one array,

in which case, some updates will

themselves use new values instead of old.

This single array version is

still guaranteed to converge,

and in fact, will usually converge faster.

This is because it gets to use the updated values sooner.

But for simplicity, we focus on the two array version.

Let's look at how iterative policy evaluation

works on a particular example.

Consider the four-by-four grid world shown here.

This is an episodic MDP with the terminal state

located in the top left and bottom right corners.

The terminal state is shown in two places,

but formally it is the same state.

The reward will be minus one for every transition.

Since the problem is episodic,

let's consider the undiscounted case of gamma equals 1.

There are four possible actions in each state up,

down, left, and right.

Each action is deterministic.

If the action would move the agent off the grid,

it instead leaves the agent in the same state.

Now, let's evaluate the uniform random policy,

which selects each of the four

actions one-quarter of the time.

The value function represents

the expected number of steps

until termination from a given state.

The order we sweep through the states is not important

since we are using the two

array version of the algorithm.

Let's assume we sweep

the states first from left to right,

and then from top to bottom.

We never update the value of

the terminal state as it is defined to be zero.

We initialize all the values in V to zero.

The initial value stored in V prime are relevant since

they'll always be updated before they are used.

We can now begin our first iteration

with the update to state one.

To compute the update,

we have to sum over all actions.

Consider the left action first,

which has probability one-quarter

under the uniform random policy.

The dynamics function, P,

is deterministic here so only the reward and

value for a 1S prime contributes to the sum.

The sum includes minus one for the reward,

and zero for the value of the terminal state.

Since we initialized all state values to zero,

and the reward for each transition is

minus one the computation for

all the other actions will look much the same.

The result is that V prime of

state one is set to minus one.

Next, we move to state two.

We first evaluate the term in

the sum for the left action.

Again the action probability is one-quarter,

and in this case,

the next state is state one.

Although we have updated the value of state one already,

the version of the algorithm we are running we'll

use the old value stored in

V. So the value

for state one in the update is still zero.

Again, all the other actions will look much the same.

The result is that V prime of state

two is also set to minus one.

In fact, since every state value is initialized to zero,

every state's value will be set to minus one.

After completing this full sweep,

we copy the updated values from V prime

to V. This has been only one sweep.

Let's discuss now the full algorithm

for iterative policy evaluation.

Take any policy we want to evaluate,

initialize two arrays V and V prime.

We can initialize these however we like,

but let's set them to zero.

We just saw how one sweep of

iterative policy evaluation works.

Let's look at how we compute multiple sweeps,

and determine how the algorithm stops.

The outer loop continues until the change in

the approximate value function becomes small.

We track the largest update to

this state value in a given iteration.

Let's call this delta.

The outer loop terminates when this maximum change is

less than some user-specified constant called theta.

As discussed before, once

the approximate value function stops changing,

we have converged to V Pi.

Similarly, once the change in

the approximate value function is very small,

this means we are close to V Pi.

Let's pick up where we left

off in our grid world example.

We had just completed our first sweep.

Let's use our value of

0.001 for the stopping parameter theta.

The smaller the value we choose,

the more accurate our final value estimate will be.

We've already completed one iteration,

and the maximum change in value was 1.0.

Since this is greater than 0.001,

we carry on to the next iteration.

After the second sweep,

notice how the terminal state starts to

influence the value of the nearest states first.

Let's run one more sweep.

We see that now the influence of

the terminal state has propagated further.

Let's run a few more sweeps to see what happens.