0:00

In this video, we're going to look at a method that was developed in the late

Â 1980's by Robbie Jacobs and then improved by a number of other people.

Â The idea is that each connection in the neural net should have its own adaptive

Â learning rate, which we set empirically by observing what happens to the weight on

Â that connection when we update it. So that if the weight keeps reversing its

Â gradient, we turn down the learning weight.

Â And if the gradient stays consistent, we turn up the learning weight.

Â So, let's start by thinking why having separate adaptive learning weights on each

Â connection is a good idea. The problem is, they're in a deep

Â multilayer net. The learning weights can vary widely

Â between different weights, especially between weights in different layers.

Â So, if for example, we start with small weights, the gradience starts from much

Â smaller in the initial layers than in the later layers.

Â 1:00

Another factor that causes one different learning rate for different weights is the

Â fan-in of the unit. The fan-in determines the size of the

Â overshoot effects that you get when you simultaneously change many of the

Â different incoming weights to fix up the same error.

Â It maybe that the unit didn't get enough input, when you change all these weights

Â at the same time to fix up the error, it now gets too much input.

Â Obviously, that effect is going to be bigger if there's a bigger fan-in.

Â So, the net in the diagram on the right has the same fain-in for both layers more

Â or less the same fain-in for both layers, but that's very different in some nets.

Â 1:41

So, the idea is that we're going to use a global learning weight which we set by

Â hand, and then we're going to multiply it by a local gain that is determined

Â empirically for each weight. A simple way to determine what those local

Â gains should be is to start with a local gain of one for every weight.

Â So that, initially we're going to change the weight, Wij, by the learning rate

Â times the gain of one, gij times the error derivative for that weight.

Â Then, what we're going to do is we're going to adapt gij.

Â 2:15

We're going to increase gij if the gradient for the weight does not change

Â side. And we're going to use small additive

Â increases, and multiplicative decreases. So, if the gradient for the weight at time

Â t has the same sign as the gradient for the weight at time t minus one, with t

Â refers to weight updates, then when you take that product, it'll be positive.

Â Cuz you already get two negative gradients or two positive gradients, and then what

Â we're going to go is increase gij by small additive amount.

Â If the gradients have opposite signs, we're going to decrease gij. And because

Â we want to damp down gij quickly if it's already big, we're going to decrease it

Â multiplicatively. That ensures that big gains will decay

Â very rapidly if oscillation start. It's interesting to ask what would happen

Â if the grading was totally random. So, on each update of the weights, pick a

Â random gradient. Then, you'll get an equal number of

Â increases and decreases cuz it will equally often be the same sign as the

Â previous gradient or the opposite sign. And so, you'll get a bunch of additive

Â 0.05 increases, and multiplicative 0.95 decreases, and they have an equilibrium

Â point which is when the gain is one. If the gain's bigger than one, the

Â multiplying by 0.95 will reduce it by more than adding 0.05. If the gain's smaller

Â than one, adding 0.05 will increase it more than multiplying by 0.95 decreases

Â it. So, with random gradients, we'll hover

Â around one. And if the gradient is consistently in the

Â same direction we can get much bigger than one.

Â If the gradient is consistently in opposite directions, which means we're

Â oscillating across a ravine, we can get much smaller than one.

Â 4:11

There's a number of tricks for making the adaptive learning rates work better.

Â It's important to limit the size of the gains.

Â A reasonable range is 0.1 to ten. Or 0.1 to 100.

Â You don't want the gains to get huge because then you can easily get into an

Â instability and they won't die down fast enough, and you'll destroy all the

Â weights. The adaptive learning rates was designed

Â for full batch learning. You can also apply it with mini batches

Â but they had better be pretty big mini batches.

Â That'll ensure that the sign, changing signs of gradience aren't due to the

Â sampling error of mini batches, They are really due to the other side of

Â the ravine. There's nothing to prevent you combining

Â adaptive learning rates with momentum. So, Jacob suggests that, instead of using

Â the agreement in sign between the current gradient and the previous gradient, you

Â use the agreement in sign between the current gradient and the velocity for that

Â weight, so the accumulated gradient. And, if you do that, you get a nice

Â combination of the advantages of momentum, and the advantages of adaptive learning

Â rates. So, adaptive learning rates only deal with

Â axis of line defects. Whereas, momentum doesn't care about the

Â alignment of the axis. Momentum can deal with these diagonal

Â ellipses and going in that diagonal direction quickly which adaptive learning

Â rates can't do.

Â