0:00

Now, another huge problem would be

Â DQN is that it actually tries to approximate a set of values that are very interrelated.

Â This is this point. Let's watch this video,

Â and in this video,

Â I want you to take a closer look in how

Â the Q values change on the bottom-left part of the screen.

Â I want you to find the segments where they are

Â more or less equal and the same as they differ as much as possible.

Â Now, there will be a quiz here to check out that you've made it through the video.

Â So, what you might have noticed is the fact that most of the time,

Â especially if the ball in breakout is on the opposite side of the game field,

Â you'll have Q values for actions more or less on the same value.

Â This is because in this state,

Â one action, even a stupid action won't change anything.

Â Even if you make one bad move while the ball is on the other side of the field,

Â you'll still have plenty of time to adjust to go the right way and fix the issue.

Â Therefore, all Q values are more less the same.

Â There's this common parts which all of those action values have.

Â This is the state value by the definition we have introduced it previously.

Â Now, there are also situations where the Q values differ a lot.

Â These are the cases where one action can make it or break it.

Â Maybe the ball is approaching your player here, batch or whatever,

Â your platform, and if you move, say, to the right,

Â you're going to move at just the right position to catch it and so it bounces off.

Â If you don't, you'll just miss the ball and lose one life.

Â So, there are rare cases where those Q values are highly different.

Â The problem here is that we are considering them as

Â more or less independent predictions that we would train our network.

Â So, let's then try to introduce some of

Â this intuition into how we train the Q network to see if it helps.

Â This brings us to another architecture.

Â It's called the dueling deep Q network.

Â The first thing we have to do is we have to decompose

Â our Q(s,a), the action value function.

Â This time we rewrite it as a sum of the state value function,

Â V(s) and it only depends on the state,

Â and the neutral which is this capital A(s,a).

Â The capital A here is the advantage function,

Â and the intuition here is the advantage is how

Â much you're action value differs from the state value.

Â For example, if you have a state in which you have two actions,

Â the first brings you returns of,

Â say, plus 100 and the second is plus 1.

Â Say, you are in a room and you have two doorways.

Â The first one leads you to a large cake,

Â and the second, to a small cookie.

Â After you take each of those actions,

Â the other opportunity gets lost.

Â Say, a door closes on the option you have not picked.

Â Now, in this case,

Â both action values are positive because you get positive reward.

Â Say, plus 100 for the cake and plus 1 for the cookie.

Â The advantages on the contrary are going to be different.

Â Some of the advantage of the suboptimal action is going to be negative.

Â This is because if you take this suboptimal action,

Â then you get the action value of plus 1 minus the state value of plus 100,

Â which is minus 99.

Â Basically, it tells you that you have just lost

Â 99 potential units of rewards in this case.

Â Now, the definition here suggests that the value function that we use is the V star,

Â the value of our optimal action.

Â But you can substitute it to any definition of value function so

Â long as it corresponds with the Q function and you understand what you're doing.

Â The way we're going to introduce this intuition

Â into a neural network is basically this way.

Â We have the usual DQN,

Â which simply tries to [inaudible] all Q functions,

Â Q values independently via the insulator,

Â and then we are going to modify it using our new decomposition.

Â Now, one unit will have one hand,

Â which only predicts the state value function which is just one number per state,

Â and then a set of all the advantages.

Â To predict those advantages, we actually have to constrain them

Â in a way that satisfies the common sense of reinforcement learning.

Â In case of V star,

Â the maximum possible advantage value is zero because you can never

Â get action value which is larger than

Â maximum over all possible action values from the state.

Â This is basically the definition of the state value of the optimal policy.

Â Now, what you do is you train those two halves

Â separately then you just add them up to get your action values.

Â In fact, you do the opposite thing.

Â You train them together.

Â You basically add them up and you minimize the same temporal difference error,

Â which we used in the usual DQN,

Â the mean squared error between Q value and improved Q value.

Â This basically, it starts neural network to approach this problem the right way.

Â By right, I mean that it should have some separate neurons that's

Â only solve the problem of how good is it to be in the state.

Â Another [inaudible] neuron is that,

Â or basically say that a particular action is better than another action.

Â Now, this is basically the whole idea behind dueling DQN.

Â The only difference is that you may define

Â those advantages and value functions differently.

Â The option we just covered is the maximization.

Â You take the constraint that suggest that the maximum advantage is zero.

Â We can also say that for example,

Â the evaluation value should be zero by substituting them in format.

Â It would roughly correspond to some of policy expected value also based action value.

Â Well, it has actually state value and advantage.

Â This technically makes some sense but in most cases,

Â just he receive that proves to be slightly better on [inaudible] problems.

Â So, here's how dueling DQM works.

Â We simply introduce those two intermediate players and then we add them up,

Â and by introducing them separately,

Â we hid the network that is there should be and of

Â two important things that may not be that much interdependent.

Â So, this is the dueling DQN.

Â Now, the final trick for this video is going to tackle

Â another issue that we have not yet improved on since the basic Q learning.

Â It's the issue of exploration.

Â The problem that we have with the DQN is that,

Â while it's so complicated, unilateral goal this,

Â [inaudible] stuff, it still explores in a rather shallow way.

Â The problem is that if you use,

Â for example, Epsilon [inaudible] policy,

Â then the probability of taking one action

Â which is sub optimal is Epsilon divided by number of actions,

Â with the probability of, say,

Â taking five or 10 suboptimal actions in a row is going to be near zero,

Â because basically Epsilon to the power of this amount of actions, five or 10.

Â If Epsilon is 0.1,

Â then you can do the math and see how small it gets.

Â The problem is that sometimes it takes to actually make those bold steps,

Â make a few suboptimal,

Â similarly suboptimal action to discover something

Â new which is not near the optimal policy but which is

Â a completely new policy in a way that it approaches the entire decision process.

Â The Epsilon [inaudible] strategy is very unlikely to discover this.

Â It is very prone to local optimal convergence.

Â There is one possible way can solve it.

Â The one way which fits well with

Â the deep Q network architecture is so-called Bootstrap DQN.

Â The [inaudible] here is that you have to train a set of key,

Â say five or 10 Q value predictors,

Â that all share the same main body parts,

Â so they all have same revolutional layers.

Â The way they are trained is at the begin of every episode,

Â you pick on them hard,

Â so you basically flip a,

Â you throw a dice and you pick one of those key has.

Â Then you follow its actions and you

Â train the weights of the head and the weights of the corresponding body.

Â Then you basically do this thing for the entire episode.

Â This way, you want the heads to going to be slightly

Â better but also your features are going to change as well.

Â Then at the beginning of the next episode,

Â you pick another head.

Â So you re-throw dice again and

Â see what happens and then you follow the policy of this new head.

Â Since those heads are not directly trained on one of these experiences,

Â they're not guaranteed to be the same so far as

Â the network has some possible ways to find different strategies.

Â So, you can expect them until [inaudible] when they convert them to a policy,

Â you can expect them to differ,

Â and this difference is going to be systematic.

Â So, they wont just take some optimal actions, say,

Â one time out of 100,

Â but they'll be fundamentally different in the way

Â which we prioritize some actions over the others.

Â Now, this simply repeat this process over all the episodes.

Â So, you pick an episode, pick a head,

Â train this head, train the body,

Â pick another head, and so on.

Â This process is, in fact, for DQN,

Â much cheaper than training,

Â say, K separate agents from scratch.

Â Because most of the heavy lifting

Â is actually done on this common body part, the future learning.

Â While this future part is getting trained on all arbitrations,

Â because all the heads are connected to it,

Â then expect the overall progress to be as fast,

Â almost as fast as the usual DQN.

Â Maybe even faster because better exploration usually means better policy and less time.

Â Now, this whole thing has a nickname of Deep Exploration Policy,

Â Deep Exploration Strategies because, again,

Â it is able to take a lot of correlated actions that are different from the other heads.

Â But otherwise, it's still more or less a heuristic that somehow works.

Â We'll link to explanations of

Â this article in a greater level of detail in varying section as usual.

Â Of course, you may expect to find dozens of [inaudible] architecture's,

Â you may even come up with your new DQN flavor yourself because as of 2017,

Â they are still getting published.

Â The last one I know were from the ICML conference from this very year.

Â The ideas of those architectures are usually that they spot some problems,

Â some issues, some way you can improve,

Â then proves this particular issue,

Â which proves us two things.

Â First, that the principles of learning gets developed really rapidly,

Â and the second one is that the DQN architectures float in all the ways you can imagine.

Â Of course, we'll find an alternative solution right next week,

Â but until then, you have to get a little bit more acquainted with the DQN.

Â