Just think how Q-learning works.

How does all those cool stuff the thing is like,

trained from partial experience or being able to learn

an optimal policy even before the first environment session ends.

Now, let's see how Q-learning fails,

and it does fail ever so miserably.

To be begin with, let's consider this simple thought experiment.

Let's say you have an environment with an agent who's on one side of the cliff.

On another side there's this diamond,

that which is giving the agent a lot of reward once he reached it.

Anything else does not give agent any reward.

And to terminate, to finish session,

agent should either get to diamond or fall into this heat glow cliff here.

Now, the idea is that I got two roads here.

The first one, the first strategy is to go forward

as quick as possible using this short path.

Or alternatively, the agent can pick

the longer road which is on the other side of this great wall here.

So, that is that you can either go following

the shortest path or use the longer path which is kind of intuitively safer.

But for now, let's consider that an agent is

always capable of going the exact directions he wants to.

So, unless it explicitly tries to fall off a cliff,

it will go along the exact path it wants to and will reach the diamond.

So, if you look on this road from above,

you have this fiery abyss here at the bottom.

And there is a shorter path with red arrow and a longer path with a green arrow.

If you're trying to use the value iteration or

Q-learning or any value-based method on this particular world,

you'll probably end up with a situation where with a gamma below one,

you'll end up with a policy which prefers, which path again?

So, what would be the optimal policy from Q-learning?

Yeah. So, it's kind of intuitive that Q-learning will follow the shortest path.

Because it optimizes the rewards ended up with this discount with gamma.

And if you get the same rewards but like four sticks later,

it will be multiplied by gamma which is below one to the power of four,

which makes it kind of less appetizing.

Now, so we've just figured out that Q-learning

will follow the lower path, the shorter one.

But the issue here is that under our current setup,

there is something that prevents Q-learning from using this path optimally

which will cause him to sometimes fall down the cliff.

What is that? Yeah. So, technically,

we've just said that we use this exploration policy to

prevent our agent from converging to a local optima.

And the issue here is that if we keep this

epsilon-greedy or any other exploration policy on, and as long as it's on,

there is some probability that agent is going to fall off a cliff because, well,

there is always let's say 0.1 probability of taking a random action.

Now, the longer path doesn't exhibit this probability here,

or maybe it does but only on the first step.

So, the probability of falling of a cliff is much smaller.

But following the Q-learning route,

the algorithm will always prefer the lower path.

So, technically, what happens is your algorithm

will have a strategy which is not optimal,

so it will not get the best possible rewards provided that you are still exploring.

So, of course, once you get rid of epsilon,

you're golden, you'll find an optimum policy.

But as long as you don't,

you will find a policy which is not giving an optimal reward.

Now, this happens only because we have to turn the epsilon on.

But in some environments,

the problem is that you have to turn epsilon,

you can never get rid of it because the conditions of the environment might change.

See, if you're trying to optimize the advertising strategy,

you're trying to show the best possible banners to get

users or recommend the best possible movies to your users in a movie portal.

The problem here is that your user base constantly change,

so you have to keep exploring.

Otherwise, you freeze at some sub-optimal strategy.

Just to reach the conclusion that no matter how long you iterate,

how long you train your Q-learning algorithm,

it will always stay sub-optimal.

And in this particular epsilon, if it stays,

it will learn a policy that will sometimes fall off a cliff,

because it prefers the lower road because it believes that

the lower road is always better if you take optimal actions which you don't.

Now, the alternative idea,

the kind of solution would be to account for the epsilon, account for the exploration.

This way you would reason like this.

So, you know that you have this 10 percent probability of doing something stupid,

and to prevent your agent from doing stupid stuff to it's doom,

you will avoid the regions where exploration is most likely to be detrimental.

And to do so, you basically pick the upper road

which still brings you rewards in maybe a couple more steps.

But it has substantial lower probability of falling off a cliff.

And to actually incorporate this intuition to algorithm,

which we are going to do right now.

We have to revise how this update rule for Q-learning works.

To have this exponentially weighted moving average

which means that whenever we get a piece of experience,

a stayed action or reward,

next state tuple, we can update our Q-function based on

our current Q-function and this Bellman equation formulation for a better Q-function.

Is of Q-learning, the Q hat,

the new improved Q-function,

is obtained by adding up the immediate rewards and discount times

the maximum over actions of a Q-function in the next state.

Now, this is where the problem occurs.

There is one thing in this particular equation that you have to change to

account for the fact that your agent explores. What this thing is?

Well yes, you could replace the maximization which you don't get with the expectation.

This expectation should be done over

the probabilities denoted by epsilon-greedy exploration policy,

or any other exploration rule that you apply. Say the Boltz formula.

And this basically change,

what it does is it discounts the value of

those states on the shortest path because if you expect over all possible outcomes,

you'll get 10 percent by divided by number of action probability,

non-zero probability of falling off a cliff and this gives you a say a minus 10

reward or zero reward depending on how you define the immediate rewards.

And this immediately makes an agent,

it steers the agent towards the green road and solves our problem.

This algorithm is called SARSA.

It's kind of another case where technical people are bad at naming things,

and so you find out where the name SARSA comes.

Let's see how this algorithm performs.

So, we have this MDP trajectory,

a sequence of states and actions.

And as usual, for example, in Q-learning,

you only consider a small subsection of this thing.

These are exactly the slides we got

in the previous section where we explained how Q-learning works.

You take S, A, R,

S prime and you use it to compute

a better estimate of Q-function to improve your grid of Q values.

And the SARSA algorithm is going to be very similar to this one.

Basically, what you have to add is you have to consider one more step here,

have state, action, rewards,

S prime and A prime.

Now, guess why SARSA?

Yeah. Here's the name. And the difference here is that instead

of taking maximization over all possible actions maximizing the Q-function,

what you do is you basically get the expectation here.

You take one sample,

the action that you actually took and you

simply use the Q-function

of the next thing to that particular action, not the maximum one.

And the rest is the same,

so you simply are obtaining your better Q-function, Q cap here.

And you make a one step towards the exponential moving average.

And that's the whole thing, it's SARSA.

Now, this basically means that instead of trying to

compute an expectation over next actions,

what you do is you take the next action from your agents.

So, you remember what action you're agent did.

And you use this A prime to perform an update.

This is of course slightly problematic if there is a lot of actions.

And there is a different way you can approach this problem,

that actually involves you computing the expectation by hand.

The aim of that algorithm is the expected value of SARSA.

And the only difference is that instead of taking the E prime into this iteration,

you consider all possible actions in this case a0, a1,

a2 from the S prime,

and you're basically computing the average over Q (S prime,

a0), Q (S prime, a1), Q (S prime,

a2) with the probabilities denoted by your policy.

So if you have an epsilon-greedy policy and the probability of random action,

the epsilon is equal to 0.1.

This means that a probability of taking an action which is not an optimal one

is just basically epsilon 0.1 divided by amount of actions, in this case three.

So, it's a 0.1 divided by three.

An optimal action would have a probability of 0.9 which is one minus epsilon

plus 0.1 divided 0.3 because you can take

an optimal action either by only getting one minus epsilon,

by picking up some action,

or by picking a random action and just getting alike.

Now, this is how your probability is

defined and for any other exploration policy says a Boltzmann one, the softmax.

You would just obtain a different value for your probabilities which would

be basically softmax of your Q-values in that case.

You can simply enumerate all the actions and can meet this expectation explicitly.

Unless of course there's like bazillion of

possible actions or maybe a continuous space of them.

In which case you can probably sample,

but let's now consider a simpler way.

So, you have like three actions and you just get this expectation.

Again, you use this expectation to get

a better or more accurate version of your Q function and you make

one step of the exponential way of moving average to improve your Q-function.

So, here we go, we have the expected value SARSA.

We also studied the usual SARSA which takes the next action.

The Q-learning which maximizes over possible actions.

The difference here is that if you consider this cliff world than

the agent's performance under constant exploration would be rest different.

The Q-learning will come to a shorter path which sometimes leads into falling.

Therefore, it will have a larger probability of getting sub-optimal reward.

A SARSA, however, will run this longer path because this way,

it can deal with its own issues of exploration and therefore,

it will obtain a better policy.

Suppose a super simple example so, in this case,

there is no need for Q-learning or SARSA,

it could just trace the optimal path with your finger.

But if you're using something more complicated,

if you're solving a problem where you have to explore because of environment changes,

say it may be an online game or trading or anything,

in this case, this issue would be very, very influential.

So, you have to keep in mind the difference between Q-learning and SARSA.