0:00

In the previous video, we talked about the backpropagation algorithm.

Â To a lot of people seeing it for the first time,

Â their first impression is often that wow this is a really complicated algorithm,

Â and there are all these different steps, and I'm not sure how they fit together.

Â And it's kinda this black box of all these complicated steps.

Â In case that's how you're feeling about backpropagation, that's actually okay.

Â Backpropagation maybe unfortunately is a less mathematically clean,

Â or less mathematically simple algorithm,

Â compared to linear regression or logistic regression.

Â And I've actually used backpropagation, you know, pretty successfully for

Â many years.

Â And even today I still don't sometimes feel like I have a very good sense of just

Â what it's doing, or intuition about what back propagation is doing.

Â If, for those of you that are doing the programming exercises,

Â that will at least mechanically step you through

Â the different steps of how to implement back prop.

Â So you'll be able to get it to work for yourself.

Â And what I want to do in this video is look a little bit more at the mechanical

Â steps of backpropagation, and try to give you a little more intuition about what

Â the mechanical steps the back prop is doing to hopefully convince you that,

Â you know, it's at least a reasonable algorithm.

Â 1:13

In case even after this video in case back propagation still seems very black box and

Â kind of like a, too many complicated steps and

Â a little bit magical to you, that's actually okay.

Â And Even though I've used back prop for many years, sometimes this is a difficult

Â algorithm to understand, but hopefully this video will help a little bit.

Â In order to better understand backpropagation,

Â let's take another closer look at what forward propagation is doing.

Â Here's a neural network with two input units that is not counting the bias unit,

Â and two hidden units in this layer, and two hidden units in the next layer.

Â And then, finally, one output unit.

Â Again, these counts two, two, two, are not counting these bias units on top.

Â In order to illustrate forward propagation,

Â I'm going to draw this network a little bit differently.

Â 2:08

And in particular I'm going to draw this neuro-network with the nodes drawn

Â as these very fat ellipsis, so that I can write text in them.

Â When performing forward propagation, we might have some particular example.

Â Say some example x i comma y i.

Â And it'll be this x i that we feed into the input layer.

Â So this maybe x i 2 and x i 2 are the values we set the input layer to.

Â And when we forward propagated to the first hidden layer here,

Â what we do is compute z (2) 1 and z (2) 2.

Â So these are the weighted sum of inputs of the input units.

Â And then we apply the sigmoid of the logistic function,

Â and the sigmoid activation function applied to the z value.

Â Here's are the activation values.

Â So that gives us a (2) 1 and a (2) 2.

Â And then we forward propagate again to get here z (3) 1.

Â Apply the sigmoid of the logistic function,

Â the activation function to that to get a (3) 1.

Â And similarly, like so until we get z (4) 1.

Â Apply the activation function.

Â This gives us a (4)1, which is the final output value of the neural network.

Â 3:24

Let's erase this arrow to give myself some more space.

Â And if you look at what this computation really is doing,

Â focusing on this hidden unit, let's say.

Â We have to add this weight.

Â Shown in magenta there is my weight theta (2) 1 0, the indexing is not important.

Â And this way here, which I'm highlighting in red,

Â that is theta (2) 1 1 and this weight here,

Â which I'm drawing in cyan, is theta (2) 1 2.

Â So the way we compute this value, z(3)1 is,

Â z(3)1 is as equal to this magenta weight times this value.

Â So that's theta (2) 10 x 1.

Â And then plus this red weight times this value,

Â so that's theta(2) 11 times a(2)1.

Â And finally this cyan weight times this value,

Â which is therefore plus theta(2)12 times a(2)1.

Â And so that's forward propagation.

Â And it turns out that as we'll see later in this video,

Â what backpropagation is doing is doing a process very similar to this.

Â Except that instead of the computations flowing from the left to the right of this

Â network, the computations since their flow from the right to the left of the network.

Â And using a very similar computation as this.

Â And I'll say in two slides exactly what I mean by that.

Â To better understand what backpropagation is doing, let's look at the cost function.

Â It's just the cost function that we had for when we have only one output unit.

Â If we have more than one output unit,

Â we just have a summation you know over the output units indexed by k there.

Â If you have only one output unit then this is a cost function.

Â And we do forward propagation and backpropagation on one example at a time.

Â So let's just focus on the single example, x (i) y (i) and

Â focus on the case of having one output unit.

Â So y (i) here is just a real number.

Â And let's ignore regularization, so lambda equals 0.

Â And this final term, that regularization term, goes away.

Â Now if you look inside the summation,

Â you find that the cost term associated with the training example,

Â that is the cost associated with the training example x(i), y(i).

Â That's going to be given by this expression.

Â So, the cost to live off examplie i is written as follows.

Â And what this cost function does is it plays a role similar to the squared arrow.

Â So, rather than looking at this complicated expression,

Â if you want you can think of cost of i being approximately the square

Â difference between what the neural network outputs, versus what is the actual value.

Â Just as in logistic repression, we actually prefer to use the slightly

Â more complicated cost function using the log.

Â But for the purpose of intuition, feel free to think of the cost function

Â as being the sort of the squared error cost function.

Â And so this cost(i) measures how well is the network doing

Â on correctly predicting example i.

Â How close is the output to the actual observed label y(i)?

Â Now let's look at what backpropagation is doing.

Â One useful intuition is that backpropagation is computing these

Â delta superscript l subscript j terms.

Â And we can think of these as the quote error of the activation value

Â that we got for unit j in the layer, in the lth layer.

Â 8:06

And so they're a measure of how much would we like to change the neural network's

Â weights, in order to affect these intermediate values of the computation.

Â So as to affect the final output of the neural network h(x) and

Â therefore affect the overall cost.

Â In case this lost part of this partial derivative intuition,

Â in case that doesn't make sense.

Â Don't worry about the rest of this,

Â we can do without really talking about partial derivatives.

Â But let's look in more detail about what backpropagation is doing.

Â For the output layer, the first set's this delta term,

Â delta (4) 1, as y (i) if we're doing forward propagation and

Â back propagation on this training example i.

Â That says y(i) minus a(4)1.

Â So this is really the error, right?

Â It's the difference between the actual value of y minus what was

Â the value predicted, and so we're gonna compute delta(4)1 like so.

Â Next we're gonna do, propagate these values backwards.

Â I'll explain this in a second, and end up computing the delta terms for

Â the previous layer.

Â We're gonna end up with delta(3)1.

Â Delta(3)2.

Â And then we're gonna propagate this further backward,

Â and end up computing delta(2)1 and delta(2)2.

Â Now the backpropagation calculation is a lot like

Â running the forward propagation algorithm, but doing it backwards.

Â So here's what I mean.

Â Let's look at how we end up with this value of delta(2)2.

Â So we have delta(2)2.

Â And similar to forward propagation, let me label a couple of the weights.

Â So this weight, which I'm going to draw in cyan.

Â Let's say that weight is theta(2)1 2,

Â and this one down here when we highlight this in red.

Â That is going to be let's say theta(2) of 2 2.

Â So if we look at how delta(2)2,

Â is computed, how it's computed with this note.

Â It turns out that what we're going to do, is gonna take this value and

Â multiply it by this weight, and add it to this value multiplied by that weight.

Â So it's really a weighted sum of these delta values,

Â weighted by the corresponding edge strength.

Â So completely, let me fill this in, this delta(2)2 is going to be equal to,

Â Theta(2)1 2 is that magenta lay times delta(3)1.

Â Plus, and the thing I had in red,

Â that's theta (2)2 times delta (3)2.

Â So it's really literally this red wave times this value,

Â plus this magenta weight times this value.

Â And that's how we wind up with that value of delta.

Â And just as another example, let's look at this value.

Â How do we get that value?

Â Well it's a similar process.

Â If this weight, which I'm gonna highlight in green,

Â if this weight is equal to, say, delta (3) 1 2.

Â Then we have that delta (3) 2 is going to be equal to that green weight,

Â theta (3) 12 times delta (4) 1.

Â And by the way, so far I've been writing the delta values only for

Â the hidden units, but excluding the bias units.

Â Depending on how you define the backpropagation algorithm, or

Â depending on how you implement it, you know, you may end up implementing

Â something that computes delta values for these bias units as well.

Â The bias units always output the value of plus one, and they are just what they are,

Â and there's no way for us to change the value.

Â And so, depending on your implementation of back prop,

Â the way I usually implement it.

Â I do end up computing these delta values, but

Â we just discard them, we don't use them.

Â Because they don't end up being part of the calculation needed to

Â compute a derivative.

Â So hopefully that gives you a little better intuition

Â about what back propegation is doing.

Â In case of all of this still seems sort of magical,

Â sort of black box, in a later video, in the putting it together video, I'll try to

Â get a little bit more intuition about what backpropagation is doing.

Â But unfortunately this is a difficult algorithm to try to visualize and

Â understand what it is really doing.

Â But fortunately I've been,

Â I guess many people have been using very successfully for many years.

Â And if you implement the algorithm you can have a very effective learning algorithm.

Â Even though the inner workings of exactly how it works can be harder to visualize.

Â