0:00

In this video, I'm going to talk about the exploding and vanishing gradients problem,

Â which is what makes it difficult to train recurrent neural networks.

Â For many years, researchers in URL networks thought they would never be able

Â to train these networks to model dependencies over long time periods.

Â But at the end of this video, I can describe four different ways in which that

Â can now be done. To understand why it's so difficult to

Â train recurrent neural networks, we have to understand a very important difference

Â between the forward and backward passes in a recurrent neural net.

Â 0:42

In the forward pass, we used squashing functions, like the logistic, to prevent

Â the activity vectors from exploding. So, if you look at the picture on the

Â right, each neuron is using a logistic unit shown by that blue curve and it can't

Â output any value greater than one or less than zero,

Â So that stops explosions. The backward pass, however, is completely

Â linear. Most people find this very surprising.

Â If you double the error derivatives is it the final layers of this net, all the

Â error derivatives will double when you back propagate.

Â So, if you look at the red dots that I put on the blue curves,

Â We'll suppose those are the activity levels of the neurons on the forward pass.

Â And so, when you back propagate, you're using the gradients of the blue curves at

Â those red dots. So the red lines are meant to throw the

Â tangents to the blue curves at the red dots.

Â And, once you finish the forward pass, the slope of that tangent is fixed.

Â You now back propagate and the back propagation is like going forwards though

Â a linear system in which the slope of the non-linearity has been fixed.

Â Of course, each time you back propagate, the slopes will be different because they

Â were determined by the forward pass. But during the back propagation, it's a

Â linear system and so it suffers from a problem of linear systems, which is when

Â you iterate, they tend to either explode or die.

Â So when we backpropagate through many layers if the weights are small the

Â gradients will shrink and become exponentially small. And that means that

Â when you backpropagate through time gradients that are many steps earlier than

Â the targets arrive will be tiny. Similarly, if the weights are big, the

Â gradients will explode. And that means when you back propagate

Â through time, the gradients will get huge and wipe out all your knowledge.

Â 3:00

But as soon as we have a recurrent neural network trained on a long sequence, for

Â example 100 time steps, then if the gradients are growing as we back

Â propagate, we'll get whatever that growth rate is to the power of 100 and if they're

Â dying, we'll get whatever that decay is to the power of 100 and, so, they'll either

Â explode or vanish. We might be able to initialize the weights

Â very carefully to avoid this and more recent work, shows that indeed careful

Â initialization of the weights does make things work much better.

Â 3:56

Here's an example of exploding and dying gradients for a system that's trying to

Â learn attractors. So suppose we try and train a recurrent

Â neural network, So that whatever state we started in, it

Â ends up in one of these two attractor states.

Â So we're going a learn blue basin of attraction and a pink basin of attraction.

Â And if we start anywhere within the blue basin of attraction, we will end up at the

Â same point. What that means is that, small differences

Â in our initial state make no difference to where we end up.

Â So the derivative of the final state with respect to changes in the initial state,

Â is zero. That's vanishing gradients.

Â When we back propagate through the dynamics of this system we will discover

Â there's no gradience from where you start, and the same with the pink basin of

Â attraction. If however, we start very close to the

Â boundary between the two attractors. Then, a tiny difference in where we start,

Â that's the other side of the watershed, makes the huge difference to where we end

Â up, that's the explosion gradient problem. And so whenever your trying to use a

Â recurrent neural network to learn attractors like this, you're bound to get

Â vanishing or exploding gradients. It turns out, there's at least four

Â effective ways to learn a recurrent neural network.

Â 5:45

The second method is to use a much better optimizer that can deal with very small

Â gradients. I'll talk about that in the next lecture.

Â The real problem in optimization is to detect small gradients that have an even

Â smaller curvature. Heissan-free Optimization, tailored to

Â your own apps is good at doing that. The third method really kind of evades the

Â problem. What we do is we carefully initialize the

Â input to hidden weights and we very carefully initialize the hidden to hidden

Â weights, and also feedback weights from the outputs to the hidden units.

Â And the idea of this careful initialization is to make sure that the

Â hidden state has a huge reservoir of weakly coupled oscillators.

Â So if you hit it with an input sequence, it will reverberate for a long time and

Â those reverberations are remembering what happened in the input sequence You then

Â try and couple those reverberations to the output you want and so the only thing that

Â learns in an Echo State Network is the connections between the hidden units and

Â the outputs. And if the output units are linear, that's

Â very easy to train. So this hasn't really learned the

Â recurrent. It's used a fixed random recurrent bit, but a carefully chosen one

Â and then just learned the hidden tripod connections.

Â 7:16

And the final method is to use momentum, but to use momentum with the kind of

Â initialization that was being used for Echo State Networks and that makes them

Â work even better. So it was very clever to find out how to

Â initialize these recurrent networks so they'll have interesting dynamics, but

Â they work even better if you now modify that dynamic slightly in that direction

Â that will help with the task at hand.

Â