0:00

So you already know by this time there are some basic strategies of exploration.

Â For example, the Q-learning,

Â and DQN and Sarsa,

Â and all those value based algorithms do exhibit.

Â They try to take

Â other best action probability one minus epsilon or random action probability epsilon.

Â And this basically, formally solves

Â the exploration problem because eventually they're going to

Â find the epsilon policy, or will we?

Â Another strategy which looked a bit more advanced,

Â although it's just as heuristical,

Â is the so-called Boltzmann exploration policy,

Â exploration strategy, again, I'm sorry.

Â In this case, you want to take your Q values and

Â make them into something that looks more like a probability.

Â In Deep Learning, you know that if you have some arbitrary values,

Â you can make them into probability by applying soft marks,

Â and multiplying or dividing them by some coefficient to account for

Â how ruthlessly do you want your agent to explore.

Â Let's look at some methods we've built in exploration.

Â For example, cross-entropy methods and the policy

Â gradient method family like reinforce or actor-critic,

Â they all learn the probability of taking action explicitly.

Â And you could, of course,

Â affect them with the regularization of entropy or something similar.

Â But in general, the algorithm decides this exploration policy for itself.

Â So now to see how all those exploration strategies

Â fare against our newly invented notion of regret,

Â let's begin with the simple one.

Â This time, we're going to explore the epsilon-greedy policy,

Â with epsilon precisely equal to 0.1.

Â This means that at any moment of time you have

Â 10 percent probability of behaving randomly,

Â and 90 percent of picking whichever action is

Â maximizing the Q function you currently have, the imperfect one.

Â Why don't you answer me this question,

Â I have this notion of regret which can be treated as a function

Â of time up to which you add up those coefficients.

Â I want you tell me what's going to happen with regret for this epsilon-greedy strategy.

Â So, what's going to be the curve which

Â represents the stack difference between epsilon-greedy policy,

Â learning over time, and the optimal policy which

Â picks the action with highest return all the time.

Â Well, yes, unfortunately it's the worst possible case,

Â it's going to grow linearly.

Â And the reason here is that if you explore the constant epsilon of 0.1,

Â means that even if you have found true Q values,

Â you can still not get

Â the optimal policy out of it because the remainder of time you're going to explore,

Â and some of this exploration is going to be bad for you.

Â So, this basically means that your graph is going to grow linearly

Â because there's going to be some constant offset between

Â 0.1 greedy strategy and the greedy one,

Â which is the optimal, which is going to stack and grow and

Â grow again until it reaches infinity, well, nearly.

Â So, this is a bad scenario,

Â and surely we're going to fix it with some of

Â the tactic we've been learning over the previous weeks.

Â How do we treat this Epsilon greedy policy better so that it converts

Â to a constant regret?

Â So, of course there's more than one way you can do that.

Â But, the general pattern is that you have to start

Â with suffixed epsilon and then decrease it over time.

Â And there is a notion of so-called greedy in the limit strategy,

Â which means that your exploration eventually converges

Â to an optimal greedy policy after infinite number of time steps.

Â For example, you can start with Epsilon equal one or whatever number you prefer,

Â and at every time step you can just multiply it by 0.99.

Â This constant is of course dependent on a particular setting,

Â but this means that at no fixed point of

Â time will your algorithm terminate the exploration get exploiting.

Â But formally, in the infinity,

Â it will asymptotically converge to a greedy policy.

Â This is theoretically good because it allows you

Â to babble about how your algorithm's regret is converging somewhere,

Â of course, provided that it is able to learn optimal policy in this region.

Â We've also seen some other mechanisms.

Â For example, the classical Deep Q Network, DQN,

Â uses not this multiplication by some constant which is near one,

Â but the linear trajectory where it subtracts

Â some constant value from espilon which is zero or some small constant value.

Â Theoretically, it's less than perfect because if you're using Epsilon,

Â then you should have reach zero Epsilon before you've gone to the optimal strategy.

Â Your grid is going to grow because you have not learned the optimal one so far.

Â But again in reality doesn't matter that much.

Â This is basically how regret works

Â in terms of the strategies you've already been using from which to.

Â You could also take say Boltzmann Exploration Policy,

Â you find out that if you use the same strategy with tau,

Â if you start with constant tau and then decrease it over time,

Â then you'll devise strategy which is

Â greedy in the limit in terms of this Boltzmann Exploration Policy.

Â We need better or worse than Epsilon greedy

Â depending on a particular environment and particular regime you're using.

Â So, this is basically duct tape challenge where the winner is not

Â that much of theoretically better algorithm but the one

Â that's just slightly better fits this particular problem.

Â Instead, I want you to focus on something, well, qualitatively different.

Â I want you to, and I'm going to get you through some of

Â the algorithms that explicitly pick actions that are most interesting to our agents.

Â This is precisely the topic of our next section.

Â