0:02

Just think how Q-learning works.

Â How does all those cool stuff the thing is like,

Â trained from partial experience or being able to learn

Â an optimal policy even before the first environment session ends.

Â Now, let's see how Q-learning fails,

Â and it does fail ever so miserably.

Â To be begin with, let's consider this simple thought experiment.

Â Let's say you have an environment with an agent who's on one side of the cliff.

Â On another side there's this diamond,

Â that which is giving the agent a lot of reward once he reached it.

Â Anything else does not give agent any reward.

Â And to terminate, to finish session,

Â agent should either get to diamond or fall into this heat glow cliff here.

Â Now, the idea is that I got two roads here.

Â The first one, the first strategy is to go forward

Â as quick as possible using this short path.

Â Or alternatively, the agent can pick

Â the longer road which is on the other side of this great wall here.

Â So, that is that you can either go following

Â the shortest path or use the longer path which is kind of intuitively safer.

Â But for now, let's consider that an agent is

Â always capable of going the exact directions he wants to.

Â So, unless it explicitly tries to fall off a cliff,

Â it will go along the exact path it wants to and will reach the diamond.

Â So, if you look on this road from above,

Â you have this fiery abyss here at the bottom.

Â And there is a shorter path with red arrow and a longer path with a green arrow.

Â If you're trying to use the value iteration or

Â Q-learning or any value-based method on this particular world,

Â you'll probably end up with a situation where with a gamma below one,

Â you'll end up with a policy which prefers, which path again?

Â So, what would be the optimal policy from Q-learning?

Â Yeah. So, it's kind of intuitive that Q-learning will follow the shortest path.

Â Because it optimizes the rewards ended up with this discount with gamma.

Â And if you get the same rewards but like four sticks later,

Â it will be multiplied by gamma which is below one to the power of four,

Â which makes it kind of less appetizing.

Â Now, so we've just figured out that Q-learning

Â will follow the lower path, the shorter one.

Â But the issue here is that under our current setup,

Â there is something that prevents Q-learning from using this path optimally

Â which will cause him to sometimes fall down the cliff.

Â What is that? Yeah. So, technically,

Â we've just said that we use this exploration policy to

Â prevent our agent from converging to a local optima.

Â And the issue here is that if we keep this

Â epsilon-greedy or any other exploration policy on, and as long as it's on,

Â there is some probability that agent is going to fall off a cliff because, well,

Â there is always let's say 0.1 probability of taking a random action.

Â Now, the longer path doesn't exhibit this probability here,

Â or maybe it does but only on the first step.

Â So, the probability of falling of a cliff is much smaller.

Â But following the Q-learning route,

Â the algorithm will always prefer the lower path.

Â So, technically, what happens is your algorithm

Â will have a strategy which is not optimal,

Â so it will not get the best possible rewards provided that you are still exploring.

Â So, of course, once you get rid of epsilon,

Â you're golden, you'll find an optimum policy.

Â But as long as you don't,

Â you will find a policy which is not giving an optimal reward.

Â Now, this happens only because we have to turn the epsilon on.

Â But in some environments,

Â the problem is that you have to turn epsilon,

Â you can never get rid of it because the conditions of the environment might change.

Â See, if you're trying to optimize the advertising strategy,

Â you're trying to show the best possible banners to get

Â users or recommend the best possible movies to your users in a movie portal.

Â The problem here is that your user base constantly change,

Â so you have to keep exploring.

Â Otherwise, you freeze at some sub-optimal strategy.

Â Just to reach the conclusion that no matter how long you iterate,

Â how long you train your Q-learning algorithm,

Â it will always stay sub-optimal.

Â And in this particular epsilon, if it stays,

Â it will learn a policy that will sometimes fall off a cliff,

Â because it prefers the lower road because it believes that

Â the lower road is always better if you take optimal actions which you don't.

Â Now, the alternative idea,

Â the kind of solution would be to account for the epsilon, account for the exploration.

Â This way you would reason like this.

Â So, you know that you have this 10 percent probability of doing something stupid,

Â and to prevent your agent from doing stupid stuff to it's doom,

Â you will avoid the regions where exploration is most likely to be detrimental.

Â And to do so, you basically pick the upper road

Â which still brings you rewards in maybe a couple more steps.

Â But it has substantial lower probability of falling off a cliff.

Â And to actually incorporate this intuition to algorithm,

Â which we are going to do right now.

Â We have to revise how this update rule for Q-learning works.

Â To have this exponentially weighted moving average

Â which means that whenever we get a piece of experience,

Â a stayed action or reward,

Â next state tuple, we can update our Q-function based on

Â our current Q-function and this Bellman equation formulation for a better Q-function.

Â Is of Q-learning, the Q hat,

Â the new improved Q-function,

Â is obtained by adding up the immediate rewards and discount times

Â the maximum over actions of a Q-function in the next state.

Â Now, this is where the problem occurs.

Â There is one thing in this particular equation that you have to change to

Â account for the fact that your agent explores. What this thing is?

Â Well yes, you could replace the maximization which you don't get with the expectation.

Â This expectation should be done over

Â the probabilities denoted by epsilon-greedy exploration policy,

Â or any other exploration rule that you apply. Say the Boltz formula.

Â And this basically change,

Â what it does is it discounts the value of

Â those states on the shortest path because if you expect over all possible outcomes,

Â you'll get 10 percent by divided by number of action probability,

Â non-zero probability of falling off a cliff and this gives you a say a minus 10

Â reward or zero reward depending on how you define the immediate rewards.

Â And this immediately makes an agent,

Â it steers the agent towards the green road and solves our problem.

Â This algorithm is called SARSA.

Â It's kind of another case where technical people are bad at naming things,

Â and so you find out where the name SARSA comes.

Â Let's see how this algorithm performs.

Â So, we have this MDP trajectory,

Â a sequence of states and actions.

Â And as usual, for example, in Q-learning,

Â you only consider a small subsection of this thing.

Â These are exactly the slides we got

Â in the previous section where we explained how Q-learning works.

Â You take S, A, R,

Â S prime and you use it to compute

Â a better estimate of Q-function to improve your grid of Q values.

Â And the SARSA algorithm is going to be very similar to this one.

Â Basically, what you have to add is you have to consider one more step here,

Â have state, action, rewards,

Â S prime and A prime.

Â Now, guess why SARSA?

Â Yeah. Here's the name. And the difference here is that instead

Â of taking maximization over all possible actions maximizing the Q-function,

Â what you do is you basically get the expectation here.

Â You take one sample,

Â the action that you actually took and you

Â simply use the Q-function

Â of the next thing to that particular action, not the maximum one.

Â And the rest is the same,

Â so you simply are obtaining your better Q-function, Q cap here.

Â And you make a one step towards the exponential moving average.

Â And that's the whole thing, it's SARSA.

Â Now, this basically means that instead of trying to

Â compute an expectation over next actions,

Â what you do is you take the next action from your agents.

Â So, you remember what action you're agent did.

Â And you use this A prime to perform an update.

Â This is of course slightly problematic if there is a lot of actions.

Â And there is a different way you can approach this problem,

Â that actually involves you computing the expectation by hand.

Â The aim of that algorithm is the expected value of SARSA.

Â And the only difference is that instead of taking the E prime into this iteration,

Â you consider all possible actions in this case a0, a1,

Â a2 from the S prime,

Â and you're basically computing the average over Q (S prime,

Â a0), Q (S prime, a1), Q (S prime,

Â a2) with the probabilities denoted by your policy.

Â So if you have an epsilon-greedy policy and the probability of random action,

Â the epsilon is equal to 0.1.

Â This means that a probability of taking an action which is not an optimal one

Â is just basically epsilon 0.1 divided by amount of actions, in this case three.

Â So, it's a 0.1 divided by three.

Â An optimal action would have a probability of 0.9 which is one minus epsilon

Â plus 0.1 divided 0.3 because you can take

Â an optimal action either by only getting one minus epsilon,

Â by picking up some action,

Â or by picking a random action and just getting alike.

Â Now, this is how your probability is

Â defined and for any other exploration policy says a Boltzmann one, the softmax.

Â You would just obtain a different value for your probabilities which would

Â be basically softmax of your Q-values in that case.

Â You can simply enumerate all the actions and can meet this expectation explicitly.

Â Unless of course there's like bazillion of

Â possible actions or maybe a continuous space of them.

Â In which case you can probably sample,

Â but let's now consider a simpler way.

Â So, you have like three actions and you just get this expectation.

Â Again, you use this expectation to get

Â a better or more accurate version of your Q function and you make

Â one step of the exponential way of moving average to improve your Q-function.

Â So, here we go, we have the expected value SARSA.

Â We also studied the usual SARSA which takes the next action.

Â The Q-learning which maximizes over possible actions.

Â The difference here is that if you consider this cliff world than

Â the agent's performance under constant exploration would be rest different.

Â The Q-learning will come to a shorter path which sometimes leads into falling.

Â Therefore, it will have a larger probability of getting sub-optimal reward.

Â A SARSA, however, will run this longer path because this way,

Â it can deal with its own issues of exploration and therefore,

Â it will obtain a better policy.

Â Suppose a super simple example so, in this case,

Â there is no need for Q-learning or SARSA,

Â it could just trace the optimal path with your finger.

Â But if you're using something more complicated,

Â if you're solving a problem where you have to explore because of environment changes,

Â say it may be an online game or trading or anything,

Â in this case, this issue would be very, very influential.

Â So, you have to keep in mind the difference between Q-learning and SARSA.

Â