0:02

For the next few minutes, I want you to get distracted

Â from all those banner ad placements and

Â small channel banners and consider

Â a more intuitive notion of how those exploration strategies work.

Â And as a result,

Â we'll finally find out how Epsilon-Greedy strategy

Â sucks not only at the theoretical level, but also practically.

Â Imagine you are solving this labyrinth,

Â the one on the slide,

Â and your initial state is here,

Â the bottom left corner,

Â while the terminal state at which you get positive reward is at the top right.

Â And let's say you're solving this mathematician process,

Â with discounted reward setting,

Â with gamma equals 0.99.

Â The only transition which actually gives you rewards is exiting this labyrinth.

Â So, getting out from this top right corner,

Â going upwards from the top right corner,

Â gives you a reward, say plus 100.

Â Let's also say that you get minus one if you bump into walls, but for now,

Â let's stick with this single non-zero rewards and all other actions are worth zero.

Â Of course, the optimal action is following the trajectory of

Â shortest path between the bottom left and top right corners,

Â but let's imagine how hard it is to learn this trajectory by trial and error.

Â Imagine you employ Q-learning here,

Â and using an Epsilon-Greedy strategy with any abstraction,

Â say Epsilon of one.

Â So you're always taking a random action,

Â you want to explore as fast as possible,

Â and you expect it to converge to this strategy where it follows optimal path.

Â The question is, how long does it take to learn this?

Â How long does it take to converge to some policy,

Â which gets at least non-zero reward?

Â Yes. Again, too bad.

Â It's a hell of a lot.

Â It's going to be comparatively too large,

Â saying the optimal path is,

Â okay the average path of random walk is going to be like,

Â say from the top of our head,

Â 50 steps long and you have four actions.

Â It is going to be roughly four to the power of 50,

Â which is a terrible,

Â terrible, terrible number if you want to actually do this many explorations.

Â And this happens because if you're exploring randomly,

Â you have a high probability of repeating yourself,

Â you have a high probability of going backward,

Â of visiting the same corner twice, thrice and so on.

Â And then if you land on some trajectory that gets you to the end,

Â your Q-learning algorithm will need a lot of

Â repeated experience of getting there to actually run the full Q-function,

Â unless, of course, if you experience your play,

Â which kind of simplifies the problem.

Â But, nevertheless, in this simpler labyrinthine case, Epsilon-Greedy policy,

Â the idea of exploring brings you a lot

Â of repeated exploration that you don't really need.

Â So, you can go right, then go left,

Â then go right again because you just rolled

Â this epsilon again and the random action was right.

Â Now, this is, of course a labyrinth.

Â It has little to do with any practical case.

Â Let's get a little bit more practical, although not entirely.

Â Let's get to some video games in Atari.

Â There's this Atari game called the Montezuma's Revenge.

Â It's known for its tremendous difficulty and unforgiving setup,

Â in which your character, this red guy,

Â has to walk over all those ladders and not get caught by this apparently aimless cow,

Â and get the key and then move it somewhere, whatever.

Â It has to get some huge sequence of actions.

Â And to get reward it has to get all those actions done right.

Â So, it won't get reward even if it goes half the way to the key.

Â It's kind of reasonable. So, what did humans do?

Â Okay, I guess any human, well, A,

Â you're saying humans who have ever held a video game would eventually

Â understand what is it it has to explore and jump to the key right away.

Â But not the DQN or any other algorithm we've been using so far.

Â In fact, the DQN score at this game is exactly zero and this is not that good.

Â Again, it never reached the key at all.

Â This is bad because humans can easily solve it, but machines cannot.

Â This means that we're missing something in our reinforcement learning framework.

Â Well, it turns out there is a lot of more practical things that humans

Â are much better at than reinforcement learning algorithms.

Â For example, you guys,

Â if you are going through this specialization one course a time,

Â you've probably learned about variational autoencoders,

Â in duration of one session,

Â one life and you've learned them into a state where you can actually reproduce them,

Â code them, and make them run, hopefully.

Â Or if not autoencoders,

Â you probably learned to write or read and

Â do all those complex things with only a little experience available.

Â And, of course, you had to explore some stuff to do so.

Â Of course, when you were a child you've probably explored all kinds of stuff,

Â but at this point of time you still do some exploration.

Â For example, you've decided to take on this course,

Â which means that you're willing to explore the new kind of math.

Â Basically, let's imagine that you were Q-learning algorithm.

Â In order to learn variational autoencoders,

Â in order to learn how to write the paper about variational autoencoders,

Â you had to generate, okay,

Â some B whatever is in random sequences of numbers with Epsilon-Greedy policy,

Â until one of them just happened to be the exact text of

Â the variational autoencoder code or paper or whatever.

Â Which is not the way you want to learn it.

Â The trick here is that, unlike Q-learning,

Â unlike any other algorithm we've learned,

Â that humans don't explore with Epsilon-Greedy policy.

Â Neither the explorer with Boltzmann or anything.

Â We don't have those heuristics doing them.

Â They have some other heuristics.

Â Let us see how humans explore.

Â Maybe you're an Epsilon-Greedy human, I guess,

Â a human that has decided to explore and has two perspective options

Â to improve its knowledge about physics and get some cool stuff done in this area.

Â You have two options.

Â The first one is, you've just read that the day before yesterday

Â some cool physicists have announced that there's some new particle and

Â they'll run on their Large Hadron Collider and no one knows anything about it.

Â So, it's a complete black box for now.

Â They don't even know how it appears,

Â but they have it in the data.

Â Alternatively, you could try to solve another problem.

Â The problem is whether you can make yourself fly if you pull your hair up real strong.

Â So, that's still technically somewhat related to physics.

Â Which of those are you going to prefer if you had,

Â say, a year free for exploration.

Â Hopefully, the first one, of course.

Â Thing is, humans are not Epsilon-Greedy

Â and they are not indifferent to what they explore.

Â In this case, you can get a lot of, well,

Â there's a lot of different explanations on why you would pick the first one,

Â and we're only going to cover some of them.

Â