Okay.

Now, we're going to have a little more math than we had before so far.

So, please bear with me as we are going to dive into a bunch of

new formulas for the next few minutes and also in the next video.

And if you find yourself lost some way in the middle,

you can always stop and rewind or look into our regional sources for more details,

and then maybe watch this video again.

But hopefully, you will not have to do it or

at least you'll not have to do too many times.

So, let's start.

Let me start with you reminding you the standard Bellman equation for the value function.

We define an optimal value function V star as shown in the equation 27 as

a maximum over all policies pi over an expected sum of discounted future rewards.

As we discussed in our previous course,

when we work with the Bellman equation for the value function,

we normally work with two equations.

One of them is a Bellman optimality equation for V star shown in equation 28.

The second one in the right hand side of

this equation is an expectation of a future value function,

conditional on the current state and action.

The optimal value function V star at time t is maximum of

the right hand side of this equation with respect to all actions a_t.

Now, a policy pi is not explicitly present in this equation.

If you already know an optimal value function V star,

then computing an optimal policy takes a different optimization problem in equation 29.

We call the step a policy improvement step in our previous course.

Now, we reformulate the Bellman optimality equation

such that the policy pi explicitly appears there.

At tick, that does it or shown here,

it's sometimes called Fenchel representation.

In this formulation, everything is the same as before except that now,

we maximize with respect to a policy pi rather than with respect to actions a_t.

Policy pi can be any policy from a set of valid distributions that we call P here.

Now, why is this formulation equivalent to the previous formulation?

This is because of a simple identity shown at the bottom of this slide.

If we have a set of numbers x_1 to x_n,

then their maximum will be the same as a maximum over a set of weights that

we again call pi here over the product of these weights and the vector of those x_i.

And this is simply by construction because the optimal set of

weights will give a weight of 1 to a maximum value of x_i,

and will give 0 weights to the rest of x values.

So, this trick is very simple but very useful because now,

the policy pi enters the problem explicitly rather than implicitly.

Now, we make a very important next step.

We introduce the notion of information costs of updating

the policy from a reference policy pi zero to a given policy pi.

This cost is given by equation 30 as a logo of the ratio of policy pi to policy pi zero.

I refer you to a paper of Tishby and co-workers for

more in-depth explanation of why this quantity is called an information cost.

But one thing we can immediately notice is that

if we take an expectation of this expression with distribution pi,

we will get what is called

The Kadlec Liebler or KL divergence of two distributions pi and pi 0,

which is shown in equation 32.

We mentioned the KL divergence a few times in this visualization and

talked about how it serves as a measure of similarity between two distributions.

The KL diversions is always non-negative and is

strictly equals 0 only when pi equals pi 0.

Now, we can take a discounted expected sum of

all such information costs for all time steps.

And this produces the total discounted information cost for a trajectory.

That is shown in equation 33.

Now, we can define the so-called free energy F_t as the value function minus

the information cost of a trajectory multiplied by a factor 1 over beta.

If we put explicit expressions for both those here,

we get the second form of this equation.

Please note that we can see the whole expression in brackets here as

a reward corrected or regularized by an information cost term.

Parameter beta controls the strengths of this civilization.

If we set beta to infinity,

then the regularization term disappears.

In the opposite limit when beta is very low and modified,

one-step rewards will be completely determined by the second term.

Parameter beta is called an inverse temperature

parameter because it enters formulas below

in a similar way to

how the inverse temperature in physics enters formulas of statistical mechanics.

So, at the end, what is the free energy function?

As you can see from this equation,

the free energy function is just an entropy-regularized value function.

As was explained in the paper by Tishby,

such entropy regularization is very helpful when an environment is noisy.

In this case, regularization term provides a helping hand.

If a player is reasonable to find

the good and noise-tolerant optimal policy and optimal value function.

And as in financing,

an environment is typically either noisy or very noisy.

Such entropy-regularization is very useful for financial problems.

We can now derive a Bellman equation directly for the free energy function.

It's shown here in the equation 35,

and it can be derived from the previous formula in the standard way.

This equation can be viewed as

a soft probabilistic relaxation of the original Bellman equation,

where beta serves as a regularization parameter.

If beta is set to infinity here,

it reproduces the original Bellman equation as you can easily check yourself.

Now, similarly to how we use KL divergence to regularize the value function,

we can also regularize an action value function to function.

We will call this function a G-function following Tishby and co-workers.

A Bellman equation for the G-function is the same as for the Q-function except we use

the F-function instead of the value function in

the right hand side as shown in equation 36.

We can now regroup terms in this relation and get the second form shown here.

So, the G-function is an expectation of the same modified rewards as in

the F-function but this time conditioned

on the current state and action like the Q-function.

In other words, the relation between the G-function and the F-function is

the same as the relation between the Q-function and the V-function.

We can also compare the resulting expression with the formulas for the F-function,

and this gives us an explicit relation between the G-function and the F-function,

which is shown in equation 37.