0:02

Well, now let's proceed with the loss.

Â Once we have defined the loss,

Â we might want to minimize it.

Â One of the most simple and most widespread method of

Â minimizing the loss is by a means of gradient descent.

Â That is where differentiate our loss with respect to parameters w

Â and change our parameters in the direction of minus gradient with some stepsize of alpha.

Â Why minus gradient?

Â Well, because the gradient is defined as direction of the fastest function increase.

Â The opposite direction that is minus

Â a gradient shows the fastest decrease of the function.

Â And this allows us to change the parameters w so

Â that our loss function decreases in the fastest way possible.

Â However, to update parameters w with gradients descent,

Â again we should differentiate the whole loss.

Â And we know that we actually cannot even compute the loss.

Â We have only is sample-based estimate.

Â Well, in fact, what we can do we can

Â approximate a true gradient with it's Stochastic estimate.

Â That is we approximate the full sum with one component of that sum.

Â And this leads us to Stochastic gradient descent.

Â In SGD, Stochastic gradient descent,

Â we approximate the full gradient with it's

Â estimate in particular state and action as a day.

Â This state in action just in the same way

Â as we have discussed previously are sampled from

Â rho pi or rho behavior policy depending on whether we learn on or off policy.

Â To reiterate, sampling state and action from rho,

Â is in practice done with picking state and action

Â from Asian's experience of direction with an environment.

Â In practice, however, one place is one sample estimate

Â of the gradient with more stable estimate using a couple of samples,

Â not only the single one.

Â For now, we have talked about a gradient but have not showed how to compute it.

Â In fact, for application of SGD,

Â we should only know how to compute gradient of L supp s

Â and a which is a square root difference between the goal and our current estimate of it.

Â Why are we talking about it?

Â It isn't the square root difference easy to differentiate? Well, it is.

Â What is more tricky here is the dependence of goal on parameters w. These goals,

Â as I have mentioned, are simple numbers in case of Monte Carlo targets.

Â But there are a function of parameters w in case of temporal difference targets.

Â And we should differentiate them with respect to parameters w. This is

Â mathematically correct way to the learning but

Â it interferes with natural understanding of the task.

Â If it differentiate goals with respect to parameters,

Â we'll kind of change the natural flow of time.

Â That is, we will not only make value estimate of current state and

Â action look more similar to our target as a cumulative for war.

Â But we also will make the subsequent estimate of

Â return be dependent on the previous rewards.

Â That is not what we want in general.

Â Thus, we introduce a so-called semi grading methods which treat

Â goal as fixed and this for any particular type of goal,

Â that is the gradient of goal with respect to the parameters is equal to zero.

Â This semi-gradient approach is very similar to what is going on in usual

Â supervised learning where the goals are almost always fixed numbers.

Â Bootstrapping methods such as

Â temporal difference learning are not in fact instances of true gradient descent.

Â They include only part of the gradient.

Â Accordingly, we call them semi-gradient methods.

Â But on the other hand,

Â they simplify math alot and are shown to work well in many practical tasks.

Â Let me now summarize what are the properties of semi-gradient methods.

Â The essence of the semi-gradient update is that it treats goals g(s,

Â a ) as fixed numbers.

Â Like gradient update, this kind of

Â update change parameters in a way that moves estimates closer to targets.

Â But unlike gradient update,

Â it completely ignores effect of update on the target.

Â And because of this semi-gradient is a proper gradient.

Â It doesn't possess the convergent properties of Stochastic gradient descent,

Â but it convergences reliably in most cases.

Â It is also more computationally efficient and faster than true SGD.

Â Having said all this,

Â semi-gradient base are meaningful thing to do

Â because there are a type of parameter of update correspond

Â to symmetric structure of the task that is the time always goes only forward.

Â Let now return to the target definitions.

Â In fact, targets are deeply connected with the algorithm names.

Â More to say, these targets define what estimate do we learn and thus,

Â are very important to understand.

Â In SARSA, our target is current rewards,

Â that is immediate R(s,

Â a ) and gamma times our estimate

Â of actual value function in the next state and the next action.

Â This next action is basically sample from our policy pi.

Â 5:26

In expected SARSA, we see almost the same update.

Â That is the same error of s and a.

Â But now the second term is an expected actual value function in the next state.

Â Whereas expectation is taken with respect to our policy probabilities.

Â In Q-learning, we see that our goal is a little bit different.

Â We have the first term is the same as in SARSA and Expected SARSA but the second term

Â is a maximum over all actions of actual value estimate of the next state an action.

Â Note that g(s, a ) in each of

Â these cases is a a random variable because it depends on the next state S_t+1.

Â And additionally, on a_t+1 in case of SARSA.

Â These doubles Stochastity of SARSA target is not a good thing, obviously.

Â And this is the main reason for why I suggest

Â using Expected SARSA instead of SARSA when it is possible.

Â Now, let's look a bit closer to each of

Â these targets and elaborate a bit on their origins.

Â You have already seen that tabular version of SARSA.

Â SARSA algorithm is an example of application of Bellman expectation equation.

Â However, in Bellman expectation equation,

Â we do what is called full-width backup.

Â That is we account for all the possible next states that we could find

Â ourselves in after committing action a instead of s. And

Â we also account for all possible actions our policy could take in each

Â of these states by taking an expectation with respect to our policy probabilities.

Â In the real world, we could not take an expectation with respect to

Â model dynamics because we don't usually know the transition probabilities.

Â However, we could estimate this expectation with Monte Carlo by taking samples.

Â That is by observing in what state do we end up after making action

Â a instead of s. We also could estimate the lower level expectation with samples.

Â That is the expectation or Stochasticity of our policy in the next state.

Â These are two self-sources of randomness we haven't garnered on previous slide.

Â Now, look at the approximates or sub date rule.

Â Well, these approximate SARSA looks pretty much the same as it's

Â tabular counterpart with the only difference is a loss multiplicative gradient term.

Â So, cool. Now, you know what is approximate SARSA

Â and that this approximate SARSA is very much similar to the standard tabular updates.

Â Well, as I have already mentioned,

Â the sources of randomness in this SARSA goal is usually not a good idea.

Â Can we really move some Stochasticity out of the goal estimate?

Â Well, yes.

Â And this leads us to the SARSA's closest cousin called Expected SARSA.

Â If we abandon the approximation on the bottom layer of the backup tree,

Â we will obtain the Expected SARSA algorithm as a machine

Â or as a backup is precisely the same as for SARSA algorithm.

Â The only thing that changes is form of update in the on the bottom level.

Â Now, it includes the expectation of

Â actions coming from both pi in the next state S prime.

Â Contrasted with SARSA where there were no expectation at

Â the bottom level but only a sample-based estimate of this expectation.

Â Expected SARSA obviously improves upon SARSA

Â because it gets raised off additional approximation.

Â And because of this fact,

Â I usually recommend using Expected SARSA instead of SARSA whenever possible.

Â The update of an Expected SARSA is almost the same as for SARSA and

Â algorithm differs only in the form of goals that current estimates are aggressed on.

Â Now, what will happen is the policy with respect to which we take an expectation on

Â the lower level of full-width backup is really one.

Â That is assigns all the probabilities as

Â an action which has a maximum Q value of the next state.

Â In fact, that is precisely the way Q-learning can be derived from.

Â Let us now revisit our Q-learning a little bit.

Â Please note that expected SARSA is based on

Â the Bellman expectation equation while

Â the Q-learning is based on the Bellman optimality equation.

Â Again, we witness that two of these concepts are

Â deeply interconnected and we can seamlessly interplay between

Â them by making an arbitrary policy pi to

Â become greedy with respect to current estimates of actual value function.

Â On the slide, you can see the form of

Â full-width backup and sample-based estimate of this backup.

Â Again, what do we see here is that we replay

Â the full-width backup of Bellman optimality equation with it's sample-based analog.

Â We do so to obtain an approximate semi-gradient Q-learning.

Â Also as a form of

Â approximate Q-learning update doesn't differ much from the tabular setting.

Â Well, the schema for replacing the full-width backup with

Â a sample-based analog is almost the same for all methods which we have discussed for now.

Â And I hope you got the idea.

Â What if we have discussed for now is a pure theory of different algorithms.

Â But we have not talked about application for approximate learning

Â in practice and practice is also very important.

Â In the next videos, you will get familiar to some of

Â the most important problems mostly coming from

Â practical application of reinforcement learning methods.

Â So, stay tuned.

Â