Okay. We have defined in the previous video the problem of

dynamic portfolio optimization which we formulated as

a mark of decision process with a quadratic one step reward function.

And as we said many times in this specialization,

as long as we have a MGP problem,

we can solve it using the model as in

dynamic programming approach or we can use

reinforcement learning and solve the model using samples from data.

In both these approaches,

we looked for an optimal policy which we defined as

a function that is a map from the space of states onto a space of actions.

So such function gives you one fixed number for each possible state.

And this, we have the would be,

the action prescribed by this policy.

Now in the reinforcement learning,

such policies are called deterministic policies.

We can view a deterministic policy as

a probability distribution given by derived delta function as shown in the equation 24.

Here, a t star is

a deterministic optimal action that is

obtained by solving the portfolio optimization problem.

Clearly, saying that we have a deterministic policy is equivalent to

saying that we have a Stochastic policy whose distribution is a delta function.

But we could consider more interest in

probability distributions than delta functions to describes the Stochastic policies.

The only question is why we should do these.

And the answer to this question is that truly deterministic policies hardly exist.

And therefore, hardly relevant for finance.

Let me explain your statement.

First, we assume that you have some deterministic policy pi,

parametrized by some parameters theta.

Now, because these parameters are found from

a finite sample of data, they're themselves random.

Therefore, any result in deterministic policy would be random de facto.

A good example is given by the same market portfolio model that we mentioned above.

Markowitz optimal portfolio allocation depends on expected stock returns.

Because they are estimated from data,

these estimates are themselves random numbers.

Therefore, the Markowitz portfolio allocation is in fact random even

though the model itself does not explicitly state it.

So because the world is random,

it would be a very good idea also to have some measure of

uncertainty or confidence in models recommendations regarding asset allocations.

If a model gives you just one number,

you have no idea how confident the model is regarding this number.

The other reason to work with stochastic policies is that most of the time,

any real data is not equally optimal and sometimes maybe even quite sub-optimal.

There might be many reasons why demonstrated data are sub-optimal.

For example, a model misspecifications,

model timing lags, human errors and so on.

If we insisted on deterministic policies,

such data will have probability zero under such models.

But because such nearly optimal or sub-optimal data are everywhere,

we better should get tools to work with

such imperfect data rather than insist that the world should match our imperfect models.

So what are stochastic policies in reinforcement learning?

A stochastic policy is any valid probability distribution pi.

For action, a t, which is a conditional on the current state x t. And in general,

can be parameterized by some parameter vector theta as shown in this formula.

In addition, the policy pi can depend on external predictors zt which we omit here for [inaudible].

So what are new insights that we can get

using stochastic policies instead of deterministic policies?

New insights are possible with stochastic policies precisely because they are stochastic.

Stochastic policy pi means that we have a probabilistic model for actions.

And these can be used for past data to build a probabilistic model of data.

But again, because this is a probabilistic model,

we can also generate future data using simulations with this model.

In other words, probabilistic models are generative models.

Now, let's see how we can change our problem of

portfolio optimization if we use the stochastic policy instead of deterministic policy.

A new optimization problem formulation is shown in this equations.

Let's go over them one by one and see

what changes are made in comparison with the previous formulation.

First, we again have an expectation over sum of discounted future rewards as before.

But this time, the expectations sign is different.

The expectation this time is done with respect to the whole path probability through pi,

which is given by the third line in this equation.

We have a product of terms for each time step here.

For the first time step,

we have the action probability pi of a0 and then for the each next step,

the joint probability is decomposed into a product of induction probability,

times a transition probability to the next step,

next state xt+1, conditional on the previous state and the action.

And because pi of a t should be a valid probability distribution,

we should add the normalization constraint for pi of 18 to this optimization.

So what we did with this probabilistic formulation is something pretty drastic.

We replaced the optimization in a function space

by optimization in the space of probability distributions.

Now, of course, the main question is how to do such optimization in probability spaces?

We will discuss this shortly but before doing that,

we have to introduce another important concept

here which is the notion of a reference policy.

So what is a reference policy?

We can think over a reference policy as a prior policy that we

would be thinking of using before we see data.

A common approach in Bayesian statistics and Bayesian methodologies is to start

with prior distribution for data and then update it using the data.

Here, we will do the same and later,

we will build a method that modifies prior or reference policies such

that a new updated posterior policy will be consistent with the data.

Now to keep things tractable,

we will use very simple Gaussian reference policy which we will call pi zero.

This policy has the mean A hat,

which is a linear function of the state xt,

defined by two coefficients A0 and A1 with a co-variance matrix sigma A.

In principle, this coefficient say zero and A1 and it could depend on the symbols zt.

But this is not necessary because such dependence will

appear as a result of Bayesian updates as we will see later.

Therefore, we can keep things simple and take

just constant coefficients A0 and A1 for the prior.

For the co-variance matrix sigma a,

we can also take a simple metrics.

For example, use the same variance and the same correlation for all stocks.

And in this case,

the whole matrix would be parameterized by just two numbers.

Okay. Now, we have all the tools that we need and in the next video,

we will see how we can work with stochastic policy and

how the reference or prior policy appears in this procedure.