0:00

Now that we have the preliminaries out the way, we can get back to the central issue,

Â which is how to learn multiple layers of features.

Â So in this video, I'm finally going to describe the back propagation algorithm

Â which was the main advance in the 1980s that led to an explosion of interest in

Â neural networks. Before I describe back propagation, I'm

Â going to describe another very obvious algorithm that does not work nearly as

Â well, but is something that many people think of.

Â Now that we know how to learn the weights of the logistic units, we're going to

Â return to the central issue, which is how to learn the weights of hidden units.

Â If you have neural networks without hidden units, they are very limited in the

Â mappings they can model. If you add a layer of hand coded features

Â as in a perceptron, you make the net much more powerful but the difficult bit for a

Â new task is designing the features. The learning won't solve the hard problem;

Â you have to solve it by hand. What we'd like is a way of finding good

Â features without requiring insights into the tasks or repeated trial and error,

Â where we guess some features and see how well they work.

Â In effect, what we need to do is automate the loop of designing features for a task

Â and seeing how well they work. We'd like the computer to do that loop,

Â instead of having a person in that loop. So the thing that occurs to everybody who

Â knows about evolution is to learn by perturbing the weights.

Â You randomly perturb one weight. So that's meant to be like a mutation, and

Â you see if it improves performance. And if it improves performance of the net,

Â you save that change in the weight. You can think of this as a form of

Â reinforcement learning. Your action consists of making a small

Â change. And then you check whether that pays off,

Â and if it does, you decide to perform that action.

Â 1:58

The problem is it's very inefficient. Just to decide whether to change one

Â weight, we need to do multiple forward passes on a representative set of training

Â cases. We have to see if changing that weight

Â improves things, and you can't judge that by one training case alone.

Â Relative to this method of randomly changing weight, and seeing if it helps,

Â back propagation is much more efficient. It's actually more efficient by a factor

Â of the number of weights in the network, which could be millions.

Â 2:33

An additional problem with randomly changing weights and seeing if it helps is

Â that towards the end of learning, any large change in weight will nearly always

Â make things worse, because the weights have to have the right relative values to

Â work properly. So towards the end of learning not only do

Â you have to do a lot of work to decide whether each of these changes helps but

Â the changes themselves have to be very small.

Â 2:58

There are slightly better ways of using perturbations in order to learn.

Â One thing we might try is to perturb all the weights in parallel and then correlate

Â the performance gain with the weight changes.

Â That actually doesn't really help at all. The problem is that we need to do lots and

Â lots of trials with different random perturbation of all the weights, in order

Â to see the effect of changing one weight, through the noise created by changing all

Â the other weights. So it doesn't help to do it all in

Â parallel. Something that does help, is to randomly

Â perturb the activities of the hidden units, instead of perturbing the weight.

Â 3:53

Since there's many fewer activities than weights, there's less things that you're

Â randomly exploring. And this makes the algorithm more

Â efficient. But it's still much less efficient than

Â backpropagation. Backpropagation still wins by a factor of

Â the number of neurons. So the idea behind back propagation is

Â that we don't know what the hidden units ought to be doing.

Â They're called hidden units because nobody's telling us what their states

Â ought to be. But we can compute how fast the error

Â changes as we change a hidden activity on a particular training case.

Â 4:57

So that allows us to compute error derivatives for all of the hidden units

Â efficiently at the same time. Once we've got those error derivatives for

Â the hidden units, that is, we know how fast the error changes as we changed the

Â hidden activity on that particular training case, it's easy to convert those

Â error derivatives for the activities into error derivatives for the weights coming

Â into a hidden unit. So here's a sketch of how backpropagation

Â works, for a single training case. First we have to define the error, and

Â here we'll use the error being the square difference between the target values of

Â the output unit J and the actual value that the net produces for the output unit

Â J. And we're gonna imagine there are several

Â output units in this case. We differentiate that, and we get a

Â familiar expression for how the error changes as you change the activity of an

Â output unit J. And I'll use a notation here where the

Â index on a unit will tell you which layer it's in.

Â So the output layer has a typical index of J, and the layer in front of that, the

Â hidden layer below it in the diagram, will have a typical index of I.

Â And I won't bother to say which layer we're in because the index will tell you.

Â 6:18

So once we've got the aeroderivative with respect to the output of one of these

Â output units, we then want to use all those aeroderivatives in the output layer

Â to compute the same quantity in the hidden layer that comes before the output layer.

Â So back propagation, the core of back propagation is taking error derivatives in

Â one layer and from them computing the error derivatives in the layer that comes

Â before that. So we want to compute DE by DY, I.

Â Now obviously, when we change the output of unit I, it'll change the activities of

Â all three of those output units, and so we have to sum up all those effects.

Â So we're going to have an algorithm that takes error derivatives we've already

Â computed for the top layer here. And combines them using the same weights

Â as we use in the forward pass to get error derivatives in the layer below.

Â 7:25

So, this slide is going to explain the backpropagation algorithm.

Â And you really need to understand this slide.

Â And the first time you see it, you may have to study it for a long time.

Â This is how you backpropagate the error derivative with respect to the output of a

Â unit. So we'll consider an output unit J on a

Â hidden unit I. The output of the hidden unit I will be

Â YI. The output of the output unit J will be

Â YJ. And the total input received by the output

Â unit J will be ZJ. The first thing we need to do is convert

Â the error derivative with respect to Y J, into an error derivative with respect to Z

Â J. To do that we use the chain rule.

Â So we say DE by DZJ, equals DYJ by DZJ, times DE by DYJ.

Â 8:23

And af, as we've seen before, when we were looking at logistic units, that's just YJ

Â into one minus YJ times the error derivative with respect to the output of

Â unit J. So now we've got the error derivative with

Â respect to the total input received by unit J.

Â 8:43

Now we can compute the error derivative with respect to the output of unit I.

Â It's going to be the sum over all of the three outgoing connections of unit I, of

Â this quantity, DZJ by DYI times DE by DZJ. So the first term there is how the total

Â input to unit J changes as we change the output of unit I.

Â And then we have to multiply that by how the error root of changes as we change the

Â total input to unit J which we computed on the line above.

Â And as we saw before when studying the logistic unit dzj by dyi is just the

Â weight on the connection wij. So what we get is that the error

Â derivative. We respect to the output of unit I is the

Â sum over all the outgoing connections to the layer above of the weight wij on that

Â connection times a quantity we would have already computed which is de by dzj for

Â the layer above. And so you can see the computation looks

Â very like what we do on the forward pass, but we're going in the other direction.

Â What we do for each unit in that hidden layer that contains I, is we compute the

Â sum of a quantity in the layer above the weights on the connections.

Â Once we've got to E by DZJ, which we computed on the first line here, it's very

Â easy to get the error derivatives for all the weights coming into unit J.

Â To E by DWIJ is simply D, E, by DZJ, which we computed already, times how ZJ changes.

Â As we change the weight on the connection. And that's simply the activity of the unit

Â in the layer below YI. So the rule for changing the weight is

Â just you multiply, this quantity you've computed at a unit, to E by DZJ, by the

Â activity coming in from the layer below. And that gives you the error of derivative

Â with respect to weight. So on this slide we have seen how we can

Â stop with DE by DYJ and back propagate to get DE by DYI we'll come backwards through

Â one layer and computed the same quantity the derivative of the error with respect

Â to the output in the previous layer. So we can clearly do that for as many

Â layers as we like. And after we've done that for all these

Â layers, we can compute how the error changes as you change the weights on the

Â connections. That's the backpropagation algorithm.

Â It's an algorithm for taking one training case, and computing, efficiently, for

Â every weight in the network, how the error will change as, on that particular

Â training case, as you change the weight.

Â