0:00

In this video, we're going to look at stochastic gradient descent learning for a

Â neural network, Particularly the mini batch version, which

Â is probably the most widely used learning algorithm for large neural networks.

Â We've seen this before, but let's start with a reminder about what the error

Â surface looks like for a linear neuron. The error surface means a surface that

Â lies in a space where the horizontal axes correspond to the weights of the neural

Â net. And the vertical axis corresponds to the

Â error it makes. For a linear neuron with a squared error,

Â that surface always forms a quadratic bowl.

Â The vertical cross sections are parabolas, and the horizontal cross sections are

Â ellipses. For multilayer non linear nets the error

Â surface is much more complicated, But as long as the weights aren't to big

Â it's a smooth error surface, and locally it's well approximated by a fraction of a

Â quadratic bowl. It might not be the bottom of the bowl but

Â there's a piece of quadratic bowl that will fit the local error surface very

Â well. If we look at the conversion speed when we

Â do full-batch learning, when the error surface is a quadratic bubble,

Â The obvious thing to do is go downhill, this will reduce the error.

Â But the problem is, that the direction of steepest descent does not point to the

Â place we want to go to. As you can see in the ellipse, the

Â direction of steepest descent is almost at rectangles to the direction we want to go

Â in. You've got a gradient that's very big

Â across the ellipse, which is the direction which we only want to travel a small

Â distance, and the gradient's very small along the ellipse, and that's the

Â direction which we want to travel a large distance.

Â It's precisely the wrong way around. Now you might think that studying linear

Â systems like this, is not a good idea if you want to optimize big non-linear nets.

Â But even for these non-linear multi-line nets, this kind of a problem arises.

Â It's a very similar problem that arises even though the error surfaces aren't

Â globally quadratic bowls. Locally they have all these same kind of

Â properties. That is they tend to be very curved in

Â some directions, and very uncurved in other directions.

Â 2:19

So the way the learning goes wrong if you use a big learning rate is that you slash

Â to and fro in the directions in which the area surface is very curved.

Â So we'll say call that slashing across a ravine.

Â And with the line rate too big you'll actually diverge.

Â What we want to achieve, is that we go quickly along the ravine in directions

Â that have small, but very consistent gradients.

Â And we move slowly in directions with these big, but very inconsistent

Â gradients. That is if you go in that direction for a

Â short distance, the gradient will reverse sign.

Â 3:00

Before we go into how we achieve that, I need to talk a little bit about stochastic

Â gradient descent, and the motivation for using it.

Â If you have a data set that's highly redundant, then if you compute the

Â gradient for a weight on the first half of the data set, you'll get almost exactly

Â the same answer as you get if you compute the gradient on the second half.

Â So it's a complete waste of time to compute the gradient on the whole data

Â set. You'd be much better off computing the

Â gradient on a subset of the data, then updating the weights and on the remaining

Â data, computing the gradient for the updated weights.

Â We can take that to extremes and say we're going to compute the gradient on a single

Â training case, we're going to update the weights and then we're going to compute

Â the gradient on the next training case using those new weights.

Â That's called online learning. In general, we don't want to go quite that

Â far. It's usually better to use small mini

Â batches, typically ten or a 100 or even 1000 examples. One advantage of a small

Â mini batch, is that less computation is used for actually updating the weights,

Â cuz you do that less often, compared with online.

Â Another advantage is that when you compute the gradient, you can compute the gradient

Â for a whole bunch of cases in parallel. Most computers are very good at doing

Â matrix, matrix multiplies, and that will allow you to consider a whole bunch of

Â training cases and apply the weights to a whole bunch of training cases at the same

Â time to figure out the activities going into the next layer for all of those

Â training cases. That gives you a matrix, matrix multiply,

Â and it's very efficient, especially on a graphics processor unit.

Â One point about using mini batches is you wouldn't want to have a mini batch in

Â which the answer is always the same and then on the next mini batch have a

Â different answer that's always the same. That would cause the weights to slosh

Â unnecessarily. The ideal, if you have say ten classes,

Â would be to have a mini batch with say ten examples or 100 examples, that has exactly

Â the same number from each class in the mini batch.

Â One way to approximate that is simply to take all your data and just put it in

Â random order and grab random mini-batches. But you must avoid having mini batches

Â that are very uncharacteristic of the whole set of data because the mini-batches

Â are all of one class. So basically there's two types of learning

Â algorithms for neural nets. There's full gradient algorithms, where

Â you compute the gradient from all of the training cases.

Â And once you've done that, there's a lot of clever ways to speed up learning.

Â There's things like nonlinear versions of a method called conjugate gradient.

Â The optimization community has been studying the general problem of how you

Â optimize smooth nonlinear functions for many years.

Â Now multi-layer neural networks are pretty untypical of the kinds of problems they

Â study. So applying the methods they developed may

Â need a lot of modification to make them work for these multi-layer neural

Â networks. But when you have highly redundant and

Â large training sets, it's nearly always better to use mini batch learning.

Â The mini batches may need to be quite big, But that's not so bad because big mini

Â batches are more computationally efficient.

Â 6:30

I'm now going to describe a basic mini-batch grading descent linear

Â algorithm. This is what most people would use when

Â they started training a big neural net on a big redundant data set.

Â Tyou start by guessing an initial learning rate,

Â And you look to see if the network learned satisfactorily or if the error keeps

Â getting worse, oscillates wildly. If that happens, you reduce the learning

Â rate. You also look to see if the error is

Â falling too slowly. You expect that the error might fluctuate

Â a bit if you measure it on a validation set, because the great electronic

Â mini-batch is just a rough estimate of the over all gradient.

Â So you don't want to reduce the learning rate every time the error arises.

Â But what you're hoping is that the error will fall fairly consistently.

Â And if it is falling fairly consistently and very slowly, you can probably increase

Â the learning rate. Once you've got that working, you can then

Â write a simple program to automate that way of adjusting the learning rate.

Â One thing that nearly always helps is, towards the end of learning with

Â mini-batches. It helps to turn down the learning rate.

Â That's because you're going to get fluctuations in the weights caused by the

Â fluctuations in the gradients that come from the mini batches.

Â And you'd like a final set of weights. As a good compromise.

Â So, when you turn down the learning rate, you're smoothing away those fluctuations,

Â and getting a final set of weights that's good for many mini-batches.

Â