0:00

In the last video,

Â you saw the equations for back propagation.

Â In this video, let's go over some intuition using

Â the computation graph for how those equations were derived.

Â This video is completely optional.

Â So, feel free to watch or not.

Â You should be able to do the whole work either way.

Â So, recall that when we talk about logistic regression,

Â we had this forward pass where we compute Z,

Â then A and then the loss.

Â And then to take the derivatives,

Â we had this backward pass where we could first compute DA,

Â and then go on to compute DZ,

Â and then go on to compute DW and DB.

Â So, the definition for the loss was L of A,

Â Y equals negative Y log A minus one,

Â minus Y times log one minus A.

Â So, if you are familiar with

Â calculus and you take the derivative of this with respect to A,

Â that would give you the formula for DA.

Â So, DA is equal to that.

Â And if we actually figure out the calculus you could show that this is

Â negative Y over A plus one minus Y over one minus A.

Â You just kind of divide that from calculus by taking derivatives of this.

Â It turns out when you take another step backwards to compute DZ,

Â we did work out that DZ is equal to A minus Y. I did explain why previously,

Â but it turns out that from the chamber of calculus DZ is equal

Â to DA times G prime of Z.

Â Where here G of Z equals sigmoid of Z

Â is our activation function for this output unit in logistic regression, right?

Â So, just remember this is still logistic regression where we have X1, X2,

Â X3 and then just one sigmoid unit and that gives us A,

Â will gives us Y end.

Â So, here are the activation function was a sigmoid function.

Â And as an aside,

Â only for those of you familiar with the chamber of calculus

Â the reason for this is because A is equal to sigmoid of Z.

Â And so, partial of L with respect to Z is equal to partial of

Â L with respect to A times DA, DZ.

Â This is A is equal to sigmoid of Z,

Â this is equal to DDZ,

Â G of Z, which is equal to G prime of Z.

Â So, that's why this expression which is DZ in our code is equal

Â to this expression which is DA in our code times G prime of Z.

Â And so this is just that.

Â So, that last derivation would made sense only if

Â you're familiar with calculus and specifically the chamber from calculus.

Â But if not don't worry about it.

Â I'll try to explain the intuition wherever it's needed.

Â And then finally having computed DZ for this regression,

Â we will compute DW which turns out was

Â DZ times X and DB which is just DZ when you have a single training example.

Â So, that was logistic regression.

Â So, what we're going to do when computing back

Â propagation for a neural network is a calculation a lot like

Â this but only we'll do it twice because now we have not X going to an output unit,

Â but X going to a hidden layer and then going to an output unit.

Â And so instead of this computation being sort of one step as we have here,

Â we'll have you two steps here in this kind of a neural network with two layers.

Â So, in this two layer neural network that is we have the input layer,

Â a hidden layer and then output layer.

Â Remember the steps of a computation.

Â First you compute Z1 using this equation,

Â and then compute A1 and then you compute Z2.

Â And notice Z2 also depends on the parameters W2 and B2.

Â And then based on Z2,

Â compute A2 and then finally that gives you the loss.

Â What backpropagation does is it will go backward to compute DA2 and then DZ2,

Â and then you go back to compute DW2 and DP2,

Â go backwards to compute DA1,

Â DZ1 and so on.

Â We don't need to take the riveter as respect to

Â the input X since the input X for supervised learning suffix.

Â We're not trying to optimize X so we won't bother to take the riveters.

Â At least, for supervised learning,

Â we respect X. I'm going to skip explicitly computing DA2.

Â If you want, you can actually compute

Â DA2 and then use that to compute DZ2 but, in practice,

Â you could collapse both of these steps into one step so you end up

Â at DZ2= A2-Y, same as before.

Â And, you have also,

Â I'm going to write DW2 and DB2 down here below.

Â You have that DW2=DZ2*A1,

Â transpose, and DB2=DZ2.

Â This step is quite similar for logistic regression where we had

Â that DW=DZ*X except that now,

Â A1 plays the role of X and there's an extra transpose there because the

Â relationship between the capital matrix W and our individual parameters W,

Â there's a transpose there, right?

Â Because W=[---] in the case of the logistic regression with a single output.

Â DW2 is like that, whereas,

Â W here was a column vector so that's why it has an extra transpose for A1,

Â whereas, we didn't for X here for logistic regression.

Â This completes half of backpropagation.

Â Then, again, you can compute DA1 if you wish.

Â Although, in practice, the computation for DA1 and

Â the DZ1 are usually collapsed into one step and so

Â what you'll actually implement is that DZ1=W2,

Â transpose *DZ2, and then times an element

Â Y's product of G1 prime of Z1.

Â And, just to do a check on the dimensions, right?

Â If you have a new network that looks like this,

Â I'll put Y if so.

Â If you have N0, NX=N0 input features,

Â N1 head in units,

Â and N2 so far.

Â N2, in our case,

Â just one output unit,

Â then the matrix W2 is (N2,N1) dimensional,

Â Z2 and therefore DZ2 are going to be (N2,N1) by one dimensional.

Â This really is going to be a one by one when we are doing binary classification,

Â and Z1 and therefore also

Â DZ1 are going to be N1 by one dimensional, right?

Â Note that for any variable foo and D foo always have the same dimension.

Â That's why W and DW always have the same dimension and similarly,

Â for B and DB and Z and DZ and so on.

Â To make sure that the dimensions of this all match up,

Â we have that DZ1=W2 transpose times DZ2

Â and then this is an element Y's product times G1 prime of Z1.

Â Matching the dimensions from above,

Â this is going to be N1 by one=W2 transpose,

Â we transpose of this so there's going to be N1 by N2 dimensional.

Â DZ2 is going to be N2 by one dimensional and then this,

Â this is the same dimension as Z1.

Â This is also N1 by one dimensional so element Y's product.

Â The dimensions do make sense, right?

Â N1 by one dimensional vector can be obtained by

Â N1 by N2 dimensional matrix times N2 by N1 because the product of

Â these two things gives you an N1 by one dimensional matrix and so this becomes

Â the element Y's product of two N1 by one dimensional vectors,

Â and so the dimensions do match.

Â One tip when implementing a back prop.

Â If you just make sure that the dimensions of your matrices match up,

Â so you think through what are the dimensions of

Â the various matrices including W1, W2, Z1,

Â Z2, A1, A2 and so on and just make sure

Â that the dimensions of these matrix operations match up,

Â sometimes that will already eliminate quite a lot of bugs in back prop.

Â All right. This gives us the DZ1 and then finally,

Â just to wrap up DW1 and DB1,

Â we should write them here I guess,

Â but since I'm running of the space right on the right of the slight,

Â DW1 and DB1 are given by the following formulas.

Â This is going to be equal to the DZ1 times X transpose

Â and this is going to be equal to DZ.

Â You might notice a similarity between these equations and these equations,

Â which is really no coincidence because X

Â plays the role of A0 so X transpose is A0 transpose.

Â Those equations are actually very similar.

Â That gives a sense for how backpropagation is derived.

Â We have six key equations here for DZ2, DW2,

Â DB2, DZ1,DW1 and D1.

Â Let me just take these six equations and copy them over to the next slide. Here they are.

Â So far, we have to write backpropagation,

Â for if you are training on a single training example at the time,

Â but it should come as no surprise that rather than working on a single example at a time,

Â we would like to vectorize across different training examples.

Â We remember that for propagation,

Â when we're operating on one example at a time,

Â we had equations like this as was say A1=G1 of Z1.

Â In order to vectorize,

Â we took say the Zs and stacked them up in

Â columns like this onto Z1M and call this capital Z.

Â Then we found that by stacking things up in columns

Â and defining the capital uppercase version of this,

Â we then just had Z1=W1 X + B

Â and A1=G1 of Z1, right?

Â We define the notation very carefully in this course to make sure that

Â stacking examples into different columns of a matrix makes all this work out.

Â It turns out that if you go through the math carefully,

Â the same trick also works for backpropagation so the vectorize equations are as follows.

Â First, if you take these DZs for different training examples and stack

Â them as the different columns of a matrix and the same for this and the same for this,

Â then this is the vectorize implementation and then here's the definition for,

Â or here's how you can compute DW2.

Â There is this extra 1/M because the cost function J is

Â this 1/M of sum for Y = one through M of the losses.

Â When computing the riveters,

Â we have that extra 1/M term just as we did when we were

Â computing the wait up days for the logistic regression.

Â That's the update you get for DB2.

Â Again, some of the DZs and then with a 1/M and then DZ1 is computed as follows.

Â Once again, this is an element Y's product only whereas previously,

Â we saw on the previous slide that this was an N1 by one dimensional vector.

Â Now, this is a N1 by M dimensional matrix.

Â Both of these are also N1 by M dimensional.

Â That's why that asterisk is element Y's product and then finally,

Â the remaining two updates.

Â Perhaps it shouldn't look too surprising.

Â I hope that gives you some intuition for how the backpropagation algorithm is derived.

Â In all of machine learning,

Â I think the derivation of the backpropagation algorithm

Â is actually one of the most complicated pieces of math I've seen,

Â and it requires knowing both linear algebra as well as

Â the derivative of matrices to re-derive it from scratch from first principles.

Â If you are an expert in matrix calculus,

Â using this process, you might prove the derivative algorithm yourself,

Â but I think there are actually plenty of deep learning practitioners

Â that have seen the derivation at about the level you've

Â seen in this video and are already able to have

Â all the very intuitions and be able to implement this algorithm very effectively.

Â If you are an expert in calculus,

Â do see if you can derive the whole thing from scratch.

Â It is one of the very hardest pieces of math.

Â One of the very hardest derivations that I've seen in all of machine learning.

Â Either way, if you implement this,

Â this will work and I think you have enough intuitions to tune and get it to work.

Â There's just one last detail I want to

Â share with you before you implement your neural network,

Â which is how to initialize the weights of your neural network.

Â It turns out that initializing your parameters,

Â not to zero but randomly,

Â turns out to be very important for training your neural network.

Â In the next video, you'll see why.

Â