0:03

By this point now, you're probably asking yourself,

what the hell are we still wasting your time and talking about some weird stuff.

Of course you don't know about all those cool algorithms.

So indeed you have learned about how do you train with Monte Carlo?

You've learned about the Q learning as a particular keys of temporal defense learning.

Those two algorithms are awesome but there's

one thing into them that would prevent them from running efficiently.

This missing link comes from a problem that you have already solved in week one.

But let's recapitulate it on a different example to get as more kind of this to you.

Let's say you're still trying to take a robot for a walk,

so you want to control your robot so that it walks forward

as fast and as harmless as possible.

And you give it reward for based in

the amount of distance it covers at each particular step.

Now imagine that you are training by Q learning.

At the beginning your Q values are zero.

So what is going to happen is, first your robot is going to,

there's probably high chance is going to fall because it's really hard to,

it's really lucky situation if you'll manage to walk by a random permutations of actions.

So the first situation is your robot falls down,

and sneeze and maybe crawls forward.

It still gets the reward. So it has positive Q functions for those actions.

Now isn't it awesome?

So it's already learn something.

The bad part is that it's probably the entire thing it's capable to learn in this way.

Problem here is that, if you use your Q learning in a naive way,

in a way if you always think the action which maximizes

the Q function then you're going to end up in this vicious circle.

You see once your robot has fallen down

and it has observed that falling down gives you a sample is to failure,

the situation is that your,

falling down strategy gives you positive value,

while any other action is zero.

So by definition, you're always falling down because it's the optimal action you know of.

Now, of course this is not right, this is not right because your active,

is that your Q function is not perfect.

So basically the problem here is that your robot is kind of

greedily exploiting whatever the knowledge he's able to obtain,

and he's not trying to get more knowledge even though

some other actions might turn out to be even

more optimal that he thinks of his current optimal action.

So this issue is very kind of ancient,

and is basically comes down to the problem of exploring environment,

versus trying to utilize what it

can launch about the environment to get the highest reward.

You've actually as I've mentioned solved this two weeks ago if the memory tells me right.

And in that case, you've used a very duct tapish method

which is not going to be that far from the optimal one. What did you do?

Yeah, right. In case of the better strategy what you do is,

you simply allowed your algorithm to

sometimes run one of the best action but they aren't one one.

In this case what you have to do is for this environment,

you have to introduce some kind of probability with each to take a random action,

because if you always take optimal actions,

you're going to be stuck in this local optimal.

Now again, the problem of exploration exploitation is of course much

more complicated than the way it's postulated and has much more advanced solutions.

Well this new arrow only going to consider the bare minimum.

The same is research exploration strategies.

They you have some guarantees of conversions but we only introduce them

because that's how we can get the problem of exploration out of the way ASAP.

Now, of course they'll be much more of that stuff but

it's coming to you during the last week of our course.

The simple six exploration strategy that you've probably

already discovered is the so-called espilon greedy exploration.

The espilon here is a probability of taking random action.

So whenever an agent is tasked to pick an action,

what he does is it basically flips a coin or well and it roll certain number with

probability epsilon It takes an action

uniformly sampled from the set of all possible actions.

Otherwise, you just takes the that add marks of the Q function.

Typical values of epsilon are again, point one,

point zero five, point even less than the problem demands it.

The exploration algorithm however has a few drawbacks.

They're quite easy to distinguish once you

start to think how this guy is going to take decisions.

In fact, all the possible actions except

of small ones are equally probable for it to explore,

and they're in a way equally attractive. This of course not true.

In a way if you're trying to find a way to work forward,

you can try it another way you say,

a frowning or maybe jumping,

some other strategy of going forward to perhaps get at more reward per second.

But for espilon strategy is just as likely to try sitting

down on your back for like one hundredth time just because the probability is equal.

You can modify this logic by using another strategy,

it's called the Boltzmann exploration strategy.

The need here is that you seem to try to link your Q values,

action values, into some kind of probability distribution and samples from them.

As you probably remember from deep learning course or any of

the advance machine messaging in the course whatsoever whenever you want

to make something into probability distribution what you do

is you apply some linearity in this case softmax.

So we take an exponent of each Q function,

then divided by the sum of exponents.

You'll also multiply Q function by this factor of Tau.

In this case if Tau is very

large then all the Q values are going to be approximately equal,

and the probability is going to be almost linear form.

If the T, if the Tau however approaches zero,

you'll get the strategy which is

more and more resembling of the greedy optimal action picking strategy.

Now this does sound a little bit more sophisticated,

but in fact you can as easily think up a way

where this optimous policy is worse than the epsilon greedy one.

For example, if you have an action which is under explored,

it has zero action value because you've never tried it,

and other two actions which are inferior but they have

positive values because that's

the kind of reward you get, you'll always get positive reward.

Well, the issue here is you're probably better off you've got feelings.

So if you feel that this policy is special, this well,

softmax one or if you're feeling like you can adjust media issue of those to policies,

and the experiment shows that you're right,

then that's the optimal way you can do,

you can solve this problem so far.

Again we have a more advanced topics about this one later on in the course.

Now the huge issue with those coalition policy if it does actually

affect their a sintogic as their algorithm properties in a negative way.

The fact that under I'll say epsilon the strategy here,

you're always going to explore.

So the problem here is that if you have an epsilon of say point one,

then you'll never be able to always take

optimal action and later be able to behave optimally

because you're always forced to flip a coin and see if tails, then explore.

If you know that by design your problem is solvable,

so the solutions will eventually convert an optimal solution.

The much you can do is you can decrease

the epsilon presented at start with a large value of point 5 because the beginning,

the agent knows nothing like don't know and it's no point in trying

to listen to its false Q values any better than the random ones.

Then after the as the training goes by,

you can gradually reduce it for example by multiplying this point five

by point 99 each time you finish a full session for example.

Then by the time you finish say 100 or a few hundred sessions,

the epsilon is going to be really close to zero.

And in the Infinity mathematically speaking,

the algorithm is going to behave greedily because the epsilon coverts it to zero.

Now, this is how you technically reason away with the fact that

you're exploring all the time but also going to exploit eventually.

But this pressure is really dangerous when you apply it hazardly.

For example, imagine that you're

having an environment that changes it's properties over time.

You're having the binner ads or prediction problem,

and you're a key audience,

it changes because there are more people approaching

your new service over time and the distribution changes.

In this case, it's never a good idea to completely

throw away exploration because you have to adapt,

you have to train your network,

and without using epsilon or any other exploration means,

you have no way of getting away from the local optimum.

This is the kind of the definitive guide into exploration in two minutes or less.

And if you want to get to more about exploration

which is in neither super or complicated problem,

I encourage you to survive until the last week of the course.