There's one more detail about

the policy-based methods that I want you to get acquainted with.

It's the fact that, remember when we had

the comparison of value-based and policy-based methods,

we stated that the policy-based methods have

a more direct compatibility with the supervisory.

We'll use this property right now to our advantage to actually train

reinforcement learning for something to solve super complicated problems.

The issue is that, imagine you are solving a super complicated video game,

or maybe a more practical problem.

Say you're trying to solve the machine translation or a chatbot.

If you do so by starting from an initial guess of random policy here,

you'll probably have your algorithm tactically converged,

but other tool converge to a poor local optima,

or to converge by the time you have your grand,

grand, grand, grand, grand, grandsons.

And that's not acceptable. The issue here is that you start from random.

On many complicated problems,

you'll have a situation where you won't get

a good reward even if you're super lucky with your random policy.

You still need some kind of initial bias to

shift to the region where the policies are not that bad.

You can of course do so by constraining

your agent by using some heuristics onto another agent,

but instead, let's try to use

this supervised plus reinforcement learning based combination.

When you're solving a practical problem,

chances are, you'll have one of those three guys available to you.

Either a problem is solved not only by machines, but also by humans,

which is the key say for machine translation,

which keys you have out of pre-recorded human experience,

the data sets from machine translation.

Other way is, if you are doing something for a web service,

before we had some kind of previous iteration of a system which

doesn't use policy gradient or any reinforcement learning methods,

or maybe it only relies on heuristics,

but it does some kind of better than random decision making built in it.

What you can do is, you can try to initialize

your reinforced learning agent with

this prior knowledge about the problem. That's how we can do that.

So, one way you can actually take advantage of the fact you have some extra knowledge,

you can rely on supervised learning to yield you

a good initial guess from which you can then go on reinforcement learning.

Then if here is that, if you're having a neural network for example,

that does machine translation,

you can pre-train this neural network on the existing data,

on say human translations.

Then you can follow up by using

policy gradient to actually improve the advantage you're after.

Then if here is that the methods that are

used for both supervising policy gradient methods,

they require very similar things.

In this case, they require a probability of making

a particular decision in a particular situation.

This can be either this static probability from supervised learning, in this case,

it's like algorithm optimization or

[inaudible] or it can be your policy in case of reinforcement learning.

If you just substitute the Y and X for reinforcement learning iteration,

what you get is this.

Basically, you're having this in-policy used in supervised learning,

and then the reinforcement learning setting.

What you do is, you basically initialize your policies at random,

then you take a few say,

apex, in case of neural networks,

to train your policy to maximize the probability of either human sessions,

or whatever the heuristic does,

or any kind of initial maybe imperfect,

but better than random system that you have before.

What you should have done is, once you have converted to something better than random,

you then log your algorithm to train for

a few more iterations in the policy gradient mode to get even better reward.

In this case, even if your initial guess was imperfect and somebody go sends,

you can correct this later on by training with the policy gradient.

By this time you've probably noticed that despite the fact that we are

combining algorithms from a largely different areas of machine learning,

in this case, supervised and reinforcement learning,

those algorithms look very similar in terms of what formulas are behind them.

Most cases, we're following a gradient, and in most cases,

the gradient is defined as the expectation of states and actions,

the expectation of the derivative of

the log policy with respect to the branch of this policy.

And in the latter case, you also multiply the policy by the action value,

the Q term here in the policy gradient.

Now, despite those formulas looking very similar,

there is actually a huge difference between how you collect those gradients,

how you obtain them in a practical environment.

I want you to find this difference.

This value, yes. Of course,

you can point this Q function here,

but this is not the only difference.

Another super important part,

is that despite those expectations looking kind of similar,

they sample from entirely different things.

First case, you only allow your algorithm to sample to train on the reference sessions.

So, you don't ever let it to actually do something and see what happens.

Instead, you train it with sessions generated by

human experts or whatever other source of data you're using.

We've just discussed one way you can greatly improve the performance of

your policy-based agent by pre-training it using the pre-existing supervised knowledge,

or how do you usually solve the problem.

This is a cool stuff, but it doesn't conclude the list of

cool stuff you can do with the policy-based methods.

For one, there's more than one way you can actually define policy gradient.

You remember we had this formulation with nabla logarithm policy times the reward,

or times the Q value depending on what kind of process you're trying to optimize.

In this case, it's one of the simplest ones.

But there is also its modification called the Trust Region Policy optimization.

But we will not be able to cover it in

detail in the scope of this lecture because it's so huge,

but you can expect it in the reading section,

and we will try to cover it as explicitly as possible.

But in general, the intuition behind

Trust Region Policy optimization that what you can do is,

you can try to improve policy not just by defining the policy gradient,

but by kind of constraining it to improve in a narrow window in the policy space.

So, you don't want to make huge leaps in the probabilities of taking action at once,

which is very easy to do if you're using a neural network.

Again, if you legalize it in this way,

if you constrain your network from making large leaps in the policy space,

you can expect a much more stable learning curve,

which will not have those large drops

in it each time the algorithm has just forgotten something important.

The cool thing is the deterministic policy gradient.

Again, you can expect a detailed description in the reading section,

but for the purpose of this conclusion,

deterministic policy gradient allows you the off-policy training of policy-based methods.

Finally, there's a lot of other stuff we can't even hope to

mention in this scope because there's just so much out there.

We have the bonus materials in the reading section this time,

so we highly encourage you to get there and see if something clicks,

if something is interesting to you because

all those areas are highly resourced right now,

and maybe you'll be able to contribute to

this research by your own ideas, by your own papers.

And this is how the reinforcement learning gets developed.