0:00

In this video, we are going to look at a number of issues that arise when using

Â stochastic gradient descent with mini patches.

Â There is a large number of tricks that make things work much better.

Â These are the kind of black outed neural networks.

Â And I'm going to go over some of the main tricks in this video.

Â The first issue I want to talk about, is initializing the way it's in your own

Â network. If two hidden units have exactly the same

Â weights, the same bioses, with incoming and I current, then they can never become

Â different from one another. Because they would always get exactly the

Â same gradient. So, to allow them to learn diffrent

Â feature detectors, you need to start them off different from one another.

Â We do this by using small random weights to initialize the weights.

Â That breaks the symmetry. Those small random weights umm shouldn't

Â all necessarily be the same size as each other.

Â So if you've got a hidden unit that has a very big fan in if you use quite big

Â weights it'll tend to saturate it so you can afford to use much smaller weights for

Â a hidden unit that has a big fan in. If you have a hidden unit with a very

Â small fan, then you want to use bigger weights.

Â And since the weights are random, it scales with the square root of the number

Â of the weights. And so a good principle is to make the

Â size of the initial weights be proportional to the square root of the

Â fan. We can also scale the learning rates for

Â the weights the same way. One thing that has a surprisingly big

Â affect on the speed with which a neural network will learn, is shifting the

Â inputs. That is adding a constant to each of the

Â components of the inputs. It seems surprising that, that could make

Â much difference. But when you're using steepest decent,

Â shifting an input value by adding a constant can make a very big difference.

Â It usually helps to shift each component of the input, so that averaged over all of

Â the training data, it has a value of zero. That is, make sure it's mean is zero.

Â 2:05

So suppose we have a little neuron-like likeness, just a linear neuron with two

Â weights. And suppose we have some training cases.

Â The first training case is where the inputs are 101 and a 101, you should give

Â an output of two. And the second one says when there are a

Â 101 and 99 you should output a zero. And I'm using color here to indicate which

Â training case I'm talking about If you look at the error surface you get for

Â those two training cases, it looks like this.

Â The green line is the line along which the weights will satisfy the first training

Â case, and the red line is the line along which the weights will satisfy the second

Â training case. And what we notice is that they're almost

Â parallel, and so when you combine them, you get a very elongated ellipse.

Â One way to think about what's going on here is that, because we're using a

Â squared error measure, we get a parabolic trough along the red line.

Â The red line is the bottom of this parabolic trough that tells us the squared

Â error we'll be getting on the red case. And there's another parabolic trough with

Â the green line along its bottom. And it turns out, although this may

Â surprise your spatial intuition. If you add together two parabolic troughs,

Â you get a quadratic bowl. And elongated quadratic bowl, in this

Â case. So that's where that error surface came

Â from. Now, look what happens, if we subtract a

Â hundred from each of those two inbook components.

Â We get a completely different area surface.

Â It's, in this case, it's a circle, it's ideal.

Â The green line is the line along which the weights add to two.

Â We're going to take the first weight, and multiply it by one.

Â We're going to take the second weight and multiply it by one.

Â And we need to get two. So the weights better add to two.

Â The red line is the line along which the two weights are equal.

Â Because we're going to take the first weight, and multiply it by one.

Â And we're going to take the second weight, and multiply it by -one.

Â So if the weights are equal, we'll be able to get that zero that we need.

Â 4:25

If you're thinking about what happens not with the input but with the hidden units.

Â It makes sense to have hidden units that are hyperbolic tangents that go between

Â -one and one. The hyperbolic tangent is simply twice the

Â logistic -one. And the reason that makes sense is because

Â then the activities of the hidden units are roughly mean zero and that should make

Â the learning faster in the next level. Of course, that's only true if the inputs

Â to the hyperbolic tangents are distributed sensibly around zero.

Â 5:01

But in that respect, a hyperbolic tangent is better than a logistic.

Â However there is other respects in which a logistic is better.

Â For example, logistic gives you a rug to sweep things under.

Â It gives an output of zero, and if you make the input even smaller than it was,

Â the output is still zero. So fluctuations in big native inputs are

Â ignored by the logistic. For the hyperbolic tangent you have to go

Â out to the end of its plateaus before it can ignore anything.

Â 5:30

Another thing that makes a big difference is scaling the inputs.

Â When we use the steepest descent, scaling the input values is a very simple thing to

Â do. We transform them so that each component

Â of the input has unit variance over the whole training set.

Â So it has a typical value of one or -one. So, again if we take this simple net with

Â two rates and we look at the error surface when the first component is very small and

Â the second component is much bigger. We get an error surface in which we get an

Â ellipse that has got a very high curvature, when the input components big

Â because small changes in the weight make a big difference in the output.

Â And very low curvature in the direction in which the input component is small because

Â small changes to the weight hardly make any difference to the error.

Â The color here is indicating which axis we're using, not which training example

Â we're using, as it did in the previous slide.

Â If we simply change the variance of the inputs, just re-scale them.

Â Make the first component ten times as big and the second component ten times as

Â small, we now get a nice circular error surface.

Â 6:49

Shifting and scaling the inputs is a very simple thing to do, but something that's a

Â bit more complicated. That actually works even better cause it's

Â guaranteed to give you a circle, a circular error surface.

Â At least it is for linear neuron. What we do is we try and decorrelate the

Â components of the input vectors. In other words, if you take two components

Â and look at how they're correlated with one another over the whole training set.

Â Like, if you remember the early example how the number of portions of chips.

Â And the number of portions of ketchup might be highly correlated.

Â We want to try and get rid of those correlations.

Â That will make learning much easier. There's actually many ways to de-correlate

Â things. For those of you who know about principle

Â components analysis. A very sensible thing to do is apply

Â principle components analysis. Remove the components that have the

Â smallest eigenvalues which already achieves some dimensionality reduction.

Â And then scale the remaining components by dividing them by the square roots of their

Â eigenvalues. For a linear system, that will give you a

Â circular error surface. If you don't know about principle

Â components, we'll cover it later in the course.

Â 8:05

Once you got a circular error surface, the gradient points straight towards the

Â minimum, so learning is really easy. Now, let's talk about a few of the common

Â problems that people encounter. One thing that can happen is if you start

Â with a learning rate that's much too big, you drive the hidden units either to be

Â firmly on, or firmly off. That is the incoming weights are very big

Â in positive or very big in negative. And their state no longer depends on the

Â input and of course that means that error root is coming from output won't affect

Â them, because they are on the plateaus where the derivative is basically zero.

Â And so learning will stop. Because people are expecting to see local

Â minimum, when learning stops they say, oh, I'm at a local minimum and the error's

Â terrible. So there are these really bad local

Â minimum, Usually that's not true.

Â Usually it's because you got stuck out on the end of a plateau.

Â 9:02

A second problem that occurs, is, if you are classifying things and you're using

Â either a squared error or a cross entropy error.

Â The best guessing strategy is normally to make the output unit equal to the

Â proportion of the time that it should be one.

Â 9:20

The network will fairly quickly find that strategy and so the error will fall

Â quickly, but particularly if the network has many layers it may take a long time

Â before it improves much on that. Because to improve over the guessing

Â statedgy it has to get sensible information from the input through all the

Â hidden layers to the output and that could take a long time to learn if you start

Â with small weights. So again, you learn quickly and then the

Â error stops decreasing, and it looks like a local minimum but actually it's another

Â platter. I mentioned earlier that towards the end

Â of learning, you should turn down the learning rate.

Â You should also be careful about turning down the learning rate too soon.

Â When you turn down the learning rate you reduce the random fluctuations in the area

Â do to the different gradings on different mini batches.

Â But of course you also reduce the rate of learning.

Â So if you look at the red curve you see that when we turn the learning rate down

Â we got a great win. The error fell but after that we get

Â slower learning. And if we do that too soon we're gonna

Â loose relative to the green curve. So don't turn down the learning rate too

Â soon, not too much. I'm now gonna talk about four main ways to

Â speed up mini-batch learning a lot. The previous things I talked about were

Â kind of a bag of tricks for making things work better.

Â And these are four methods all explicitly designed to make the learning go much

Â faster. I'm now gonna talk about a mathical

Â moment. In this method we don't use the gradient

Â to change the position of the whites. That is, if you think of the whites as a

Â ball on the error surface, standard gradient descent uses the gradient to

Â change the position of that ball. You simply multiply the gradient by a

Â learning rate and change the position of the ball by that vector.

Â In the momentum method, we use the gradient to accelerate this ball.

Â That is the gradient changes it's velocity.

Â And then the velocity is what changes the position of the ball.

Â The reason that's different is because the bull can have momentum.

Â That is, it remembers previous gradients in its philosophy.

Â 11:43

A second method for speeding up when you're batch learning is to use a separate

Â adaptive learning rate for each parameter. And then to slowly adjust that learning

Â rate based on empirical measurements. And the obvious empirical measurement is

Â are we keeping making progress by changing the weights in the same direction?

Â Or does the gradient keep oscillating around so that the sign of the grading

Â keeps changing. If the sign of the grading keeps changing,

Â what we're going to do is reduce the learning rate and if it keeps staying the

Â same, we're going to increase the learning rate.

Â 12:16

A third method is what I now call rms prop and what we do in this method is we divide

Â by a running average of the magnitudes of the recent gradients flat weight.

Â So that if the gradients are big you divided by a large number and if the

Â gradients is small and you divide then divide by small number.

Â That will deal very nicely with a wide range of different gradients.

Â It's actually a mini batch version of just using the sign of the gradient which is a

Â method called R prompt, that was designed for full batch learning.

Â The final way of speeding up learning, which is what optimization people would

Â naturally recommend, is to use full batch learning.

Â And to use a fancy method that takes curvature information into account.

Â To adapt that method to work for neural nets.

Â And then maybe to try and adapt it some more, so it works with mini batches.

Â I am not going to talk about that in this lecture.

Â