0:03

By this point now, you're probably asking yourself,

Â what the hell are we still wasting your time and talking about some weird stuff.

Â Of course you don't know about all those cool algorithms.

Â So indeed you have learned about how do you train with Monte Carlo?

Â You've learned about the Q learning as a particular keys of temporal defense learning.

Â Those two algorithms are awesome but there's

Â one thing into them that would prevent them from running efficiently.

Â This missing link comes from a problem that you have already solved in week one.

Â But let's recapitulate it on a different example to get as more kind of this to you.

Â Let's say you're still trying to take a robot for a walk,

Â so you want to control your robot so that it walks forward

Â as fast and as harmless as possible.

Â And you give it reward for based in

Â the amount of distance it covers at each particular step.

Â Now imagine that you are training by Q learning.

Â At the beginning your Q values are zero.

Â So what is going to happen is, first your robot is going to,

Â there's probably high chance is going to fall because it's really hard to,

Â it's really lucky situation if you'll manage to walk by a random permutations of actions.

Â So the first situation is your robot falls down,

Â and sneeze and maybe crawls forward.

Â It still gets the reward. So it has positive Q functions for those actions.

Â Now isn't it awesome?

Â So it's already learn something.

Â The bad part is that it's probably the entire thing it's capable to learn in this way.

Â Problem here is that, if you use your Q learning in a naive way,

Â in a way if you always think the action which maximizes

Â the Q function then you're going to end up in this vicious circle.

Â You see once your robot has fallen down

Â and it has observed that falling down gives you a sample is to failure,

Â the situation is that your,

Â falling down strategy gives you positive value,

Â while any other action is zero.

Â So by definition, you're always falling down because it's the optimal action you know of.

Â Now, of course this is not right, this is not right because your active,

Â is that your Q function is not perfect.

Â So basically the problem here is that your robot is kind of

Â greedily exploiting whatever the knowledge he's able to obtain,

Â and he's not trying to get more knowledge even though

Â some other actions might turn out to be even

Â more optimal that he thinks of his current optimal action.

Â So this issue is very kind of ancient,

Â and is basically comes down to the problem of exploring environment,

Â versus trying to utilize what it

Â can launch about the environment to get the highest reward.

Â You've actually as I've mentioned solved this two weeks ago if the memory tells me right.

Â And in that case, you've used a very duct tapish method

Â which is not going to be that far from the optimal one. What did you do?

Â Yeah, right. In case of the better strategy what you do is,

Â you simply allowed your algorithm to

Â sometimes run one of the best action but they aren't one one.

Â In this case what you have to do is for this environment,

Â you have to introduce some kind of probability with each to take a random action,

Â because if you always take optimal actions,

Â you're going to be stuck in this local optimal.

Â Now again, the problem of exploration exploitation is of course much

Â more complicated than the way it's postulated and has much more advanced solutions.

Â Well this new arrow only going to consider the bare minimum.

Â The same is research exploration strategies.

Â They you have some guarantees of conversions but we only introduce them

Â because that's how we can get the problem of exploration out of the way ASAP.

Â Now, of course they'll be much more of that stuff but

Â it's coming to you during the last week of our course.

Â The simple six exploration strategy that you've probably

Â already discovered is the so-called espilon greedy exploration.

Â The espilon here is a probability of taking random action.

Â So whenever an agent is tasked to pick an action,

Â what he does is it basically flips a coin or well and it roll certain number with

Â probability epsilon It takes an action

Â uniformly sampled from the set of all possible actions.

Â Otherwise, you just takes the that add marks of the Q function.

Â Typical values of epsilon are again, point one,

Â point zero five, point even less than the problem demands it.

Â The exploration algorithm however has a few drawbacks.

Â They're quite easy to distinguish once you

Â start to think how this guy is going to take decisions.

Â In fact, all the possible actions except

Â of small ones are equally probable for it to explore,

Â and they're in a way equally attractive. This of course not true.

Â In a way if you're trying to find a way to work forward,

Â you can try it another way you say,

Â a frowning or maybe jumping,

Â some other strategy of going forward to perhaps get at more reward per second.

Â But for espilon strategy is just as likely to try sitting

Â down on your back for like one hundredth time just because the probability is equal.

Â You can modify this logic by using another strategy,

Â it's called the Boltzmann exploration strategy.

Â The need here is that you seem to try to link your Q values,

Â action values, into some kind of probability distribution and samples from them.

Â As you probably remember from deep learning course or any of

Â the advance machine messaging in the course whatsoever whenever you want

Â to make something into probability distribution what you do

Â is you apply some linearity in this case softmax.

Â So we take an exponent of each Q function,

Â then divided by the sum of exponents.

Â You'll also multiply Q function by this factor of Tau.

Â In this case if Tau is very

Â large then all the Q values are going to be approximately equal,

Â and the probability is going to be almost linear form.

Â If the T, if the Tau however approaches zero,

Â you'll get the strategy which is

Â more and more resembling of the greedy optimal action picking strategy.

Â Now this does sound a little bit more sophisticated,

Â but in fact you can as easily think up a way

Â where this optimous policy is worse than the epsilon greedy one.

Â For example, if you have an action which is under explored,

Â it has zero action value because you've never tried it,

Â and other two actions which are inferior but they have

Â positive values because that's

Â the kind of reward you get, you'll always get positive reward.

Â Well, the issue here is you're probably better off you've got feelings.

Â So if you feel that this policy is special, this well,

Â softmax one or if you're feeling like you can adjust media issue of those to policies,

Â and the experiment shows that you're right,

Â then that's the optimal way you can do,

Â you can solve this problem so far.

Â Again we have a more advanced topics about this one later on in the course.

Â Now the huge issue with those coalition policy if it does actually

Â affect their a sintogic as their algorithm properties in a negative way.

Â The fact that under I'll say epsilon the strategy here,

Â you're always going to explore.

Â So the problem here is that if you have an epsilon of say point one,

Â then you'll never be able to always take

Â optimal action and later be able to behave optimally

Â because you're always forced to flip a coin and see if tails, then explore.

Â If you know that by design your problem is solvable,

Â so the solutions will eventually convert an optimal solution.

Â The much you can do is you can decrease

Â the epsilon presented at start with a large value of point 5 because the beginning,

Â the agent knows nothing like don't know and it's no point in trying

Â to listen to its false Q values any better than the random ones.

Â Then after the as the training goes by,

Â you can gradually reduce it for example by multiplying this point five

Â by point 99 each time you finish a full session for example.

Â Then by the time you finish say 100 or a few hundred sessions,

Â the epsilon is going to be really close to zero.

Â And in the Infinity mathematically speaking,

Â the algorithm is going to behave greedily because the epsilon coverts it to zero.

Â Now, this is how you technically reason away with the fact that

Â you're exploring all the time but also going to exploit eventually.

Â But this pressure is really dangerous when you apply it hazardly.

Â For example, imagine that you're

Â having an environment that changes it's properties over time.

Â You're having the binner ads or prediction problem,

Â and you're a key audience,

Â it changes because there are more people approaching

Â your new service over time and the distribution changes.

Â In this case, it's never a good idea to completely

Â throw away exploration because you have to adapt,

Â you have to train your network,

Â and without using epsilon or any other exploration means,

Â you have no way of getting away from the local optimum.

Â This is the kind of the definitive guide into exploration in two minutes or less.

Â And if you want to get to more about exploration

Â which is in neither super or complicated problem,

Â I encourage you to survive until the last week of the course.

Â