Now, one popular solution to this problem of optimism is the so-called double Q-learning.

What it says is basically that,

if you cannot trust one Q function,

Double Q-learning could make you learn two of them,

to train one another.

You have this Q1 and Q2.

Those are two independent estimates of the actual value function.

So in Double Q-learning,

those are just two independent tables,

and in deep Q network case,

those are two neural networks that have several sets of weights.

What happens is, you update them one after the other,

and you use one Q function to train the other one, and vice versa.

Let's take a closer look to that bit rule.

We have the Q1,

we just update it by following the rule.

You take the rewards to Gamma times maximum of Q of the next state,

but this time you do this maximization in a very cunning way.

You take an action which Q1 deems optimal,

and you take this action from Q2,

or you can interchange those two terms.

You can take the action which is deemed optimal by Q2

and take the action value of this action from Q1.

Now, this defeats this over optimism because here is what happens,

you have your Q functions that may be over optimistic,

or pessimistic, or whatever,

just because of the noise and how they are trained.

Maybe one of the Q functions is likely to

have several updates where it's getting good next states and therefore,

it's too optimistic due to the moving errors that they form.

Now, the idea here is that,

if you take an action which is optimal by that being one of this first Q function,

where say action one is optimal,

because it's just Q of that weight for randomness,

then there is no connection between this overoptimism for action one and Q1,

and the same overoptimism in Q2.

In fact, the same action in Q2,

is more or less independent.

It can be overoptimistic or overpessimistic and it can be exactly the true value.

The idea here is that, the noise in Q2 is independent of the noise in Q1.

And if you update them that way,

then you take the maximization which will take account of the sampling error.

If all Q functions are equal for example,

than the maximum of say

Q2 is going to be just basically a random action because of how noise works.

If you take the expectation of basically Q value of random action from Q1,

you'll get exactly the maximum of expectations in the limit of course,

if you take all those samples.

You do the same thing with Q2.

Basically, you take the Q2 and you use Q1 to help it to update itself.

So you maximize by one Q network

and take the action value for this maximal action from the other one.

And here's how it works.

Basically you're trying two networks,

and since they are more or less decorrelated,

they have different kinds of noises, not talking,

one noise but different realizations of this noise,

then the overoptimism disappears.

Now lets see how we can apply this to a DQN more efficiently.

Just as a reminder, DQN is again just an

unilateral convolutional one with experience replay,

and target networks to stabilize training.

Now by default, you could of course train two Q networks.

The Q1 and Q2 but this will effectively double the convergence time.

So if it is previously converged one week on GPU,

then now it is going to converge over two weeks and

that's kind of unacceptable in the scope of our course.

Instead I want you to think up of some smarter way.

You could use the current state of DQN to get the same effect.

In fact what we need is,

we need some way to maximize over one network and take the value over the other network,

and this basically requires that you have two networks that are kind of decorrelated.

They are not statistically speaking absolutely independent,

but they are expected to have different kinds of local noise here.

Now can we find some pair of networks to do this trick in

DQN without us having to retrain another network from scratch. How do we do that?

Well, yes.

One way you could try to solve this problem and the way the actual article,

introduced this method suggested,

is that you use the target network,

the older snapshot of your network as

the source of independent randomness as the other Q network.

So you only train your Q1,

but instead of using the other Q network to get this smart,

maximizing and taking the action value,

you just take the action value of your old Q network,

the target network that corresponds to an action optimal on the reoccurring Q network.

Let's walk through this step by step.

In your usual DQN,

you have this update rules,

the first rule here which just takes the reward plus Gamma

times the maximum over the target networks action values.

You can rewrite this mathematically by simply replacing

maximization over action values by taking the actual value of the maximal action.

So basically, substituting mass with auto mass here.

What we're going to do next,

is we're going to assign

those two Q functions on the right hand side to different networks.

So we have first Q function which we use to take the action value,

we use the target Q network for this one because once stability is here and so on.

The other Q network, which is used to maximize

our action is our own trainable Q network. The main one.

Therefore, we take old Q networks Q values,

corresponding to actions that are optimal under our current Q network,

which are going to be more or less independent,

provided that we update our target network too rarely.

And of course in our usual DQN,

the updates happen every say 100,000 iterations

so the dependencies there are more or less negligible.

Sincerely speaking it's more or less a humanistic which doesn't

guarantee anything but it's very unlikely that it fails.

So it's a practical algorithm which uses some duct tape and some black magic to

get efficient results without training another set of parameters from scratch.