0:02

Now, this leads us to another familiar for Reinforcement algorithms,

Â they called it, Actor-critic algorithms.

Â The idea behind Actor-critic is that it learns pretty

Â much everything you're able to run by this moment in course.

Â It has a value-based part,

Â where it tries to approximate the value function,

Â but it also uses the policy function here,

Â which can also be approximated the same way.

Â The idea is that by combining value function policy function,

Â they can obtain a better learning performance and some of

Â the properties that say, reinforce algorithm Lax.

Â Now let's push back to the veil of mystery a little bit,

Â and next we describe all of those algorithms in more detail.

Â This time we're going to focus on the Advantage actor-critic.

Â Some of you that developed sense of intuition have probably

Â noticed that it has this advantage term here,

Â and as your intuition suggests,

Â this advantage is, after the idea that we are going to learn this,

Â differences between your current key function and the average performance in this state.

Â So again, this algorithm learns both your policy and the value function,

Â and it actually learns a value function to improve the performance of policy training.

Â To understand how this is possible,

Â I want you to answer a question for me.

Â Say you've just sampled this State- Action-Reward

Â and next state tuple from the environment,

Â and you've also learn the V-function,

Â so for every stage you're able to get the exact amount of expected discounted returns.

Â The idea here is that,

Â you want to use this information to compute the advantage,

Â so you want to either produce the advantage itself or get

Â some kind of unbiased estimate of it, how would you do that?

Â Well, it turns out that,

Â if you remember the properties of the functions of Q and V,

Â it's kind of easy,

Â just requires three lines of math.

Â You start by writing out the advantage,

Â as a difference between your action value function,

Â the Q function, and the expectation of value function.

Â Now, this expectation of value function is just the Q function for the simplicity.

Â To kind of make this formula easier to compute,

Â you have to remember that the Q function can be rewritten as a reward,

Â plus gamma the discount times the value function of the next state.

Â This of course requires expectation over all possible next states,

Â but you can take just one sample from the environment,

Â just like you did in Q-learning.

Â Now, the advantage becomes,

Â the difference between the Q function and the V function, which is,

Â as this line suggests,

Â is just the difference between the,

Â reward plus gamma times the value of s prime,

Â minus the value of s. Now,

Â this is how you can use just the V function to estimate your advantage,

Â and presumably learn better.

Â And this allows us to do

Â this very simple substitutional formulae

Â which actually brings a lot of improvements to us.

Â We just take the Q function and place it with the Q minus C,

Â also known as the advantage function.

Â So the [inaudible] changed ever so slightly,

Â but this allows us to use this idea that want to encourage

Â the difference between how agent performed now and how it usually performs.

Â And even in a strich with agent got

Â poor reward but that is way higher than what it usually gets in the state,

Â you would encourage this improvement rather than

Â discourage it and in comparison to other situations.

Â The only question remaining is how do we get this V function?

Â So know that's if we get this V function everything is great,

Â now we have to somehow estimate this function for your particular environment.

Â Think Atari for now,

Â let's say you applying a breakout and you want to

Â estimate the V function, how do you do that?

Â Yeah, you would approximate,

Â or use some other tricks but these tricks are usually specific to an environment.

Â So what you do is you train a natrobey that has those two outputs.

Â First, it has to learn a policy because otherwise there is no point to do anything else.

Â The second part is that it estimates the very function.

Â Speaking on a deep learning language,

Â the policy is a layer that has as a many units as you have actions,

Â and it uses soft marks to return a valid from both distribution.

Â The key thing here is just a single unit,

Â it's so densely with one neuron, which is non-linearity,

Â just like you have with Q learning functions with the DQN for example,

Â you then have to perform two kinds of updates.

Â First, you have to update the policy,

Â in this case you believe that your V is good enough and you use

Â your V function to provide this better estimate for the policy gradient.

Â You then ascend this policy gradient over your new other parameters.

Â The second important task, that you have to refine your value function,

Â this is done in a similar way as you have done in deep Q-learning,

Â the deep SARSA or any other approximate value-based algorithms.

Â You just compute the mean squared error,

Â over all the tuples of SA errors as prime that you get,

Â and this way you can dosh the expectation of the value function.

Â And of course, make some improvements based on the kind of DQN enhancements,

Â but those are usually slightly hard and they

Â don't bring as much reward here as they bring in the DQN keys.

Â What you do is, you simply learn those two functions interchangeably,

Â so you compute the gradient of this number J ascendant,

Â then you compute the gradient of this mean squared error and you

Â descend it over the parameters of your value function the critic.

Â Now, another important part is that,

Â you have to refine your V function as well.

Â This case you use an algorithm which is very similar to how you train DQN before.

Â You simply take your Sa error as prime tuples,

Â you can compute the temporal difference error, in this case,

Â it means squared error, you minimize it

Â by following the back propagation for the neural network.

Â The deal here is that, if you take a lot of

Â samples you'll convert a mathematical expectation,

Â this way you'll get the kind of true V function.

Â So, it's also important to know that in this case you're not

Â that much reliant on the V on the value-based part of your network,

Â as you were in the DQN, because even though

Â your value function is not very accurate at the beginning, you can still subtract it.

Â Remember you can subtract anything which is not dependent on your action,

Â and V function is definitely not,

Â by its definition by the design of your neural network.

Â In this case,

Â even a poorly trained value function will bring you

Â some improvement on how your agent trains.

Â As I've already mentioned, this other FFM is called the acro-critic algorithms.

Â This case you have to cast the actor and the critic.

Â The first hat the blue one is the actor.

Â It picks the actions,

Â it just models the probability of picking action a,

Â and state s, and this is your policy.

Â The second hat is the critic.

Â The idea of critic is, basically,

Â it estimates how good your particular state is,

Â it basically it's used to help training your network.

Â The idea is that if you train your actor and critic hats interchangeably,

Â you'll not only obtain this value-based hat as a side quest,

Â that allows you to measure the value,

Â you'll also obtain algorithm that improves

Â the convergence of your policy based particular actor,

Â by using an order of managers.

Â We'll see how those two methods,

Â the reinforce and the actor-critic compare on

Â practical problems later on in this lecture.

Â