Welcome back to our reinforcement learning course.

Today, we'll talk about;

policy evaluation and policy improvement.

These are two main concepts which are toolkit for finding and optimal policy,

in a model-based setup.

Or just to remind you what is what is model-setup?

That is when one know what is well of the world,

that is know all probabilities of rewards

and next states given current state and current action.

And we also are going to talk about finding

an optimal policy following value based approach.

So what is value based approach?

Mainly it consists of two phases.

And the first phase one need to build or estimate a value, value function.

And second phase is to extract a policy from the value.

This second step could be performed only on demand.

Only when you are staying inside

the state S and want to know which action to perform now.

The first technique which is very crucial for us is policy evaluation.

It's a famous saying that if you can't measure something you can't improve it.

And we will start our journey of finding an optimal policy from this policy evaluation.

From measuring of how good is a given policy in each of the state.

That is how many reward or

how great and expected return we can

get starting from state S and following N policy pie.

The main helper for this task is Bellman expectation equation.

Which you should be already familiar with.

And this Bellman expectation equation is basically a system of linear equations,

where number of unknowns is equal to number of equations and number of states.

So, what are unknowns here?

Unknowns here are a value function of each state.

So, for each state we associate

a number which is a value function of this state under a particular policy pie.

We are going to find that values for each state.

How to solve this system of linear equations,

in fact there are many methods for solving systems of linear equations.

But, many of them are designed to obtain an optimal precise solution to the system.

And this approaches could take us too much time to

compute even for million number of states,

and for problems with a million number of states.

However, we most of the time don't want to solve this system

precisely and it is sufficient for us to obtain an approximate solution.

Why is this so? We will talk about a little bit later.

But for now consider another approach an alternative one.

In fact the specificity of

a reinforcement learning task helps one to apply a very simple alternative algorithm.

This algorithm is proven to converge for any market decision process and any policy.

So the algorithm is as follows, at first,

the initialization of value function to zeros is performed.

This initialization can be also replaced with any initialization of your choice,

as long as all the terminal states are assigned the value of zero.

Next, for each state the right hand side of the Bellman expectation equation is computed.

Then the freshly computed right hand side,

is assigned to the current value of the state.

This alternative process of assigning the right hand side to the left hand side,

is performed until state values do not change anymore,

or the change is sufficiently little.

Such an approach to policy evaluation owes us to quickly obtain an approximate solution.

And this is precisely what we want.

Now, we know how to get the value of each state for a given policy pie.

So, once the value of the policy is computed,

the policy itself could and should be improved.

Intuitively speaking policy evaluation gives us

the necessary information about the direction in which we can improve the policy.

This improvement can be done by acting greedily with respect to the value function.

What does it mean to act greedily with respect to function?

In fact what we do is first we recover an action value function Q from value function

V. And this V can be computed by a policy evaluation algorithm as we already know.

Second, we find an action which maximizes this action value function Q pie,

that is we find what is the best possible action in status.

Thereafter, this action we will follow policy pie.

This improvement procedure is guaranteed to produce a better policy.

To obtain some intuition of why is it so,

think about action value function Q pie.

It tell us what expects to return one because of chief after making action A instead S,

and following policy pie thereafter.

Our new policy pie prime in current state makes

the best possible action according to the action of the function of all policy pie.

That is, it chooses an action according to the highest expected combination

of immediate reward and an old policy pies value function of next state.

But, in this next state,

new policy prime also acts greedily with respect to Q pie.

That is, it makes a better or the same decision as policy pie.

The same is true about a state,

after the next state and so on.

That is why the value of each state of a new policy pie prime

is greater or equal to the value of the same state of an old policy pie.

And thus, the new policy pie prime is considered better or equal than our old policy pie.

Well, this makes some sense,

but what if at some step

the improved new policy pie prime is actually equal to an old policy?

Well, if an improved policy pie prime is precisely equal to the old policy pie,

that is if they recommend the same action for each and every state,

then we conclude that this policy is an optimal one.

In fact, the equality of policy pie prime and

pie means that their functions are also equal.

And because during an improvement stage we assigned two policy pie

prime the action which maximizes the action value function,

we also know what is the value of the new policy pie prime.

It is equal to action value function of state S

and precisely it is the same action we assigned to the new policy.

That is, its value is equal to action value function maximized over all actions.

But, this equation is a Bellman optimality equation itself.

Well, for now we know how to evaluate policy,

how to improve it and why does this make any sense?

Let me now stress the difference of how to recover

an optimal policy if we were given an optimal value function V,

an optimal action value function Q.

Well, if Q star is known,

an optimal policy is rather straightforward to obtain.

We just need to take the action which maximizes its function.

However, things are not so simple with value function.

Pretend for the moment that V star is known,

How could we recover the optimal policy from it?

In fact, one need to recover first an action value function from this V star.

And this can be done with

environment probabilities precisely in

the same way as we did in policy improvement stage.

And by recovering this Q star we had used the problem of recovering the policy from

V star to the previous problem which we already know how to solve.

Know that in a model free setup where we don't have access to

transition probabilities we're unable to recover an optimal policy solely from V star.

And this is why in a model free setup we could only rely on Q star.

Well, by now there is one thing is not completely clear.

When we were talking about the solution to the system of Bellman expectation equations,

I said that most of the time we don't need a precise solution of this system.

It is sufficient for us to obtain a good approximation to this solution.

Why is it so? Let's look at the small greed world problem.

In this greed world problem an agent could

escape the world through either of the two internal states,

one in the top left with reward of minus 10 and one in the bottom right,

with a reward of plus 10.

For each time take an agent additionally receives a negative reward of minus one,

which motivates an agent to escape as quickly as possible.

Left plot corresponds to

current value function estimates which are all initial is to zero.

In the right plot you can see which actions are taken by

the policy which is greedy with respects to current value function estimate.

That is a policy on the right,

is an improvement over the initial random policy.

And this improvement is achieved by acting greedily

with respect to value function estimates plotted on the left.

At the very beginning, all actions are equally likely because

all state value estimates are equal to each other.

However, after the only five iterations of policy evaluation,

the value of states are sufficient equity to yield

the same new policy as a precise solution of the system of Bellman expectation equations.

This precise solution is shown in the last draw of plot.

Please note that values of the precise solution differ

a lot from the intermediate step solution obtain after the fifth iteration.

But, for policy improvement the absolute state values do not

matter as long as the order of such values is the same as in precise solution.

That means that doing all iterations of policy evaluation after

the first one is absolutely useless for a subsequent policy improvement step.

By now, we are familiar with policy evaluation, and policy improvement.

We also know what is the connection of

these procedures with Bellman expectation and optimality equations.

We may also summarize that these two procedures are essential to find an optimal policy.

Well, our next topic to discuss is

indeed how to use this procedures to obtain the optimal policy.