0:02

There's one more detail about

Â the policy-based methods that I want you to get acquainted with.

Â It's the fact that, remember when we had

Â the comparison of value-based and policy-based methods,

Â we stated that the policy-based methods have

Â a more direct compatibility with the supervisory.

Â We'll use this property right now to our advantage to actually train

Â reinforcement learning for something to solve super complicated problems.

Â The issue is that, imagine you are solving a super complicated video game,

Â or maybe a more practical problem.

Â Say you're trying to solve the machine translation or a chatbot.

Â If you do so by starting from an initial guess of random policy here,

Â you'll probably have your algorithm tactically converged,

Â but other tool converge to a poor local optima,

Â or to converge by the time you have your grand,

Â grand, grand, grand, grand, grandsons.

Â And that's not acceptable. The issue here is that you start from random.

Â On many complicated problems,

Â you'll have a situation where you won't get

Â a good reward even if you're super lucky with your random policy.

Â You still need some kind of initial bias to

Â shift to the region where the policies are not that bad.

Â You can of course do so by constraining

Â your agent by using some heuristics onto another agent,

Â but instead, let's try to use

Â this supervised plus reinforcement learning based combination.

Â When you're solving a practical problem,

Â chances are, you'll have one of those three guys available to you.

Â Either a problem is solved not only by machines, but also by humans,

Â which is the key say for machine translation,

Â which keys you have out of pre-recorded human experience,

Â the data sets from machine translation.

Â Other way is, if you are doing something for a web service,

Â before we had some kind of previous iteration of a system which

Â doesn't use policy gradient or any reinforcement learning methods,

Â or maybe it only relies on heuristics,

Â but it does some kind of better than random decision making built in it.

Â What you can do is, you can try to initialize

Â your reinforced learning agent with

Â this prior knowledge about the problem. That's how we can do that.

Â So, one way you can actually take advantage of the fact you have some extra knowledge,

Â you can rely on supervised learning to yield you

Â a good initial guess from which you can then go on reinforcement learning.

Â Then if here is that, if you're having a neural network for example,

Â that does machine translation,

Â you can pre-train this neural network on the existing data,

Â on say human translations.

Â Then you can follow up by using

Â policy gradient to actually improve the advantage you're after.

Â Then if here is that the methods that are

Â used for both supervising policy gradient methods,

Â they require very similar things.

Â In this case, they require a probability of making

Â a particular decision in a particular situation.

Â This can be either this static probability from supervised learning, in this case,

Â it's like algorithm optimization or

Â [inaudible] or it can be your policy in case of reinforcement learning.

Â If you just substitute the Y and X for reinforcement learning iteration,

Â what you get is this.

Â Basically, you're having this in-policy used in supervised learning,

Â and then the reinforcement learning setting.

Â What you do is, you basically initialize your policies at random,

Â then you take a few say,

Â apex, in case of neural networks,

Â to train your policy to maximize the probability of either human sessions,

Â or whatever the heuristic does,

Â or any kind of initial maybe imperfect,

Â but better than random system that you have before.

Â What you should have done is, once you have converted to something better than random,

Â you then log your algorithm to train for

Â a few more iterations in the policy gradient mode to get even better reward.

Â In this case, even if your initial guess was imperfect and somebody go sends,

Â you can correct this later on by training with the policy gradient.

Â By this time you've probably noticed that despite the fact that we are

Â combining algorithms from a largely different areas of machine learning,

Â in this case, supervised and reinforcement learning,

Â those algorithms look very similar in terms of what formulas are behind them.

Â Most cases, we're following a gradient, and in most cases,

Â the gradient is defined as the expectation of states and actions,

Â the expectation of the derivative of

Â the log policy with respect to the branch of this policy.

Â And in the latter case, you also multiply the policy by the action value,

Â the Q term here in the policy gradient.

Â Now, despite those formulas looking very similar,

Â there is actually a huge difference between how you collect those gradients,

Â how you obtain them in a practical environment.

Â I want you to find this difference.

Â This value, yes. Of course,

Â you can point this Q function here,

Â but this is not the only difference.

Â Another super important part,

Â is that despite those expectations looking kind of similar,

Â they sample from entirely different things.

Â First case, you only allow your algorithm to sample to train on the reference sessions.

Â So, you don't ever let it to actually do something and see what happens.

Â Instead, you train it with sessions generated by

Â human experts or whatever other source of data you're using.

Â We've just discussed one way you can greatly improve the performance of

Â your policy-based agent by pre-training it using the pre-existing supervised knowledge,

Â or how do you usually solve the problem.

Â This is a cool stuff, but it doesn't conclude the list of

Â cool stuff you can do with the policy-based methods.

Â For one, there's more than one way you can actually define policy gradient.

Â You remember we had this formulation with nabla logarithm policy times the reward,

Â or times the Q value depending on what kind of process you're trying to optimize.

Â In this case, it's one of the simplest ones.

Â But there is also its modification called the Trust Region Policy optimization.

Â But we will not be able to cover it in

Â detail in the scope of this lecture because it's so huge,

Â but you can expect it in the reading section,

Â and we will try to cover it as explicitly as possible.

Â But in general, the intuition behind

Â Trust Region Policy optimization that what you can do is,

Â you can try to improve policy not just by defining the policy gradient,

Â but by kind of constraining it to improve in a narrow window in the policy space.

Â So, you don't want to make huge leaps in the probabilities of taking action at once,

Â which is very easy to do if you're using a neural network.

Â Again, if you legalize it in this way,

Â if you constrain your network from making large leaps in the policy space,

Â you can expect a much more stable learning curve,

Â which will not have those large drops

Â in it each time the algorithm has just forgotten something important.

Â The cool thing is the deterministic policy gradient.

Â Again, you can expect a detailed description in the reading section,

Â but for the purpose of this conclusion,

Â deterministic policy gradient allows you the off-policy training of policy-based methods.

Â Finally, there's a lot of other stuff we can't even hope to

Â mention in this scope because there's just so much out there.

Â We have the bonus materials in the reading section this time,

Â so we highly encourage you to get there and see if something clicks,

Â if something is interesting to you because

Â all those areas are highly resourced right now,

Â and maybe you'll be able to contribute to

Â this research by your own ideas, by your own papers.

Â And this is how the reinforcement learning gets developed.

Â