0:00

In this video we're going to look at the error surface for a linear neuron.

Â By understanding the shape of this error surface, we can understand a lot about

Â what happens as a linear neuron is learning.

Â We can get a nice geometrical understanding of what's happening when we

Â learn the weights of a linear neuron. By considering a space that's very like

Â the weight space that we use to understand perceptrons, but it has one extra

Â dimension. So we imagine a space in which all the

Â horizontal dimensions correspond to the weights.

Â And there's one vertical dimension that corresponds to the error.

Â So in this space, points on the horizontal plane, correspond to different settings of

Â the weights. And the height corresponds to the error

Â that your making with that set of weights, summed over all training cases.

Â For a linear neuron, the errors you make for each set of weights define error

Â surface. And this error surface is a quadratic

Â bowl. That is, if you take a vertical

Â cross-section, it's always a parabola. And if you take a horizontal

Â cross-section, it's always an ellipse. This is only true for linear systems with

Â a squared error. As soon as we go to a multilayer nonlinear

Â neuron nets, this error surface will get more complicated.

Â As long as the weights aren't too big, the error surface will still be smooth, but it

Â may have many local minimum. Using this error surface we can get a

Â picture of what's happening as we do gradient descent learning using the delta

Â rule. So what the delta rule does is it computes

Â the derivative of the error with respect to the weights.

Â And if you change the weights in proportion to that derivative, that's

Â equivalent to doing steepest descent on the error surface.

Â To put it another way, if we look at the error surface from above, we get

Â elliptical contour lines. And the delta rule is gonna take it at

Â right angles to those elliptical contour lines, as shown in the picture.

Â That's what happens with what's called batch learning, where we get the grayed

Â in, summed overall training cases. But we could also do online learning,

Â where after each training case, we change the weights in proportion to the gradient

Â for that single training case. That's much more like what we do in

Â perceptrons. And, as you can see, the change in the

Â weights moves us towards one of these constraint planes.

Â So in the picture on the right, there are two training cases.

Â To get the first training case correct, we must lie on one of those blue lines.

Â And to get the second training case correct, the two weights must lie on the

Â other blue line. So if we start at one of those red points,

Â and we compute the gradient on the first training case, the delta rule will move us

Â perpendicularly towards that line. If we then consider the other training

Â case, we'll move perpendicularly towards the other line.

Â And if we alternate between the two training cases, we'll zigzag backwards and

Â forwards, moving towards the solution point which is where those two lines

Â intersect. That's the set of weights that is correct

Â for both training cases. Using this picture of the error surface,

Â we can also understand the conditions it will make learning very slow.

Â If that ellipse is very elongated, which is gonna happen if the lines that

Â correspond to training cases is almost parallel, then when we look at the

Â gradient, it's going to have a nasty property.

Â If you look at the red arrow in the picture, the gradient is big in the

Â direction in which we don't want to move very far, and it's small in the direction

Â in which we want to move a long way. So the gradient will quickly take us

Â across the bottom of that ravine. Corresponding to the narrow axis of the

Â ellipse. And will take a long time to take us along

Â the ravine, corresponding to the long Xs of the ellipse.

Â It's just the opposite of what we want. We'd like to get a great into a small

Â across the ravine, and big along the ravine but that's not what we get.

Â And so, simple steepest descent, in which you change each weight in proportion to a

Â learning rate times the error derivative, is gonna have great difficulty, with very

Â elongated surfaces like the one shown in the picture.

Â