0:02

Now, to make a recap. We need to learn an optimal policy.

Â To do so, we need to define our policy,

Â initialize it's main bit random,

Â and then we need some kind of algorithm that improves this policy,

Â explain the policy part.

Â We have to define the kind of the behavior.

Â We have to define how algorithm takes actions.

Â There is two general approaches to how you do this.

Â The first, the simplest one, is you learn an algorithm

Â that takes your state and predicts one action.

Â So basically, it doesn't learn anything but for the number

Â of action or maybe the value of action if it's continuous.

Â The second kind is the idea that you can learn a probabilistic distributions.

Â So you can learn to predict the probabilities of taking each possible action.

Â Those two approaches are kind of

Â different in the way that what algorithms they work with,

Â and they're presenting method,

Â for example, can only work the stochastic policy.

Â Now, let's try to compare those two approaches.

Â Let's say that you have two algorithms with first one learns

Â the deterministic policy and second learn stochastic policy sum distribution,

Â and you try to compare them across different games.

Â Is there maybe some case in which stochastic policy will learn

Â an optimal policy and deterministic policy will fail? Any ideas?

Â For example, let's say you have some kind of game where you have an adversary.

Â You have an opponent which tries to get you to lose.

Â See how you play rock-paper-scissors.

Â Idea here is that the optimal policy in rock-paper-scissors,

Â if you're playing against your recent opponent,

Â is to pick all possible actions at random.

Â If you only have one action,

Â if you're using deterministic policy,

Â the opponent is going to adapt and always show

Â the item that will beat your current policy.

Â And this way, you won't be able to learn anything.

Â Now, the stochastic policy will be able to converge to a probability of one over three,

Â in this case, one-third.

Â And this way, it will very much better than a deterministic policy.

Â Another feature of stochastic policy that kind of takes care of exploration for you.

Â Remember in Q learning, you had to pick the optimal action

Â but you had to throw or flip a coin and with probability action,

Â you have to pick a random action as the optimal one.

Â And to do so, to explore the space of possible strategy, space of actions.

Â And this time, you won't have to do this

Â because you already have stochastic policy that which simples actions at random.

Â Now, deterministic policy, of course,

Â you has a requirement of sampling exploration strategy.

Â It cannot be seen as a pure boon of stochastic policies because sometimes,

Â you do want to explicitly see what kind of exploration you want,

Â and stochastic policy methods like [inaudible] method doesn't allow you to do so explicitly.

Â Instead, it relies on some kind of penalties and regulations.

Â There's one thing we have not discussed about the stochastic policies.

Â The idea is that if you have, say, five actions,

Â you're solving Atari and the actions have buttons,

Â this kind of symbol to decide how to define probability distribution.

Â You simply memorize the probability of each action,

Â make sure they sum to one.

Â Now, the different cases where you have a continuous value for actions.

Â Say you are controlling a robot, and your action is

Â what kind of voltage do you want to apply to the joint, to the motor there.

Â In this case, you cannot simply memorize all possible outcomes,

Â their probabilities, because there is continuous amount of them.

Â How did you find the probabilistic policy in case of continuous actions? Any ideas?

Â In this kind of thing, you could try some kind of distribution.

Â For continuous variable, a normal distribution would do.

Â Or maybe, if you have a bound that contiues forever,

Â you could try some kind of better distribution or

Â something similar that relates to your particular problem.

Â The methods we're going to study right now are so-called policy based methods.

Â There are two kinds of main families of methods in reinforcement learning.

Â There are the value based methods and the policy based ones.

Â The value based methods rely on, first,

Â learning some kind of value function,

Â V or Q or whatever.

Â And then, they infer policy given the value function.

Â Remember, if you have all perfect Q values,

Â then you can simply find the optimal ones,

Â the maximum Q function in this particular state,

Â and this would be your optimal action.

Â However, if you don't know you have some error in Q values,

Â your policy would be sub-optimal.

Â Policy based methods don't rely on this thing.

Â They try to explicitly learn probability or deterministic policy,

Â and they adjust them to implicitly

Â maximize the expected reward or some other kind of objective.

Â