So now that we know how a neural network looks like, all we have to do is find these derivatives in order to do great in descent. So recall that the goal is to adjust each one of the highlighted weights and bias in order to reduce the loss function L y, y hat. And so basically, we need to see how each one of these weights affects the loss. And that is what the partial derivative of L with respect to Wij tells us. Same thing with these biases. How do these biases affect the loss? Well, dL over dbi tells us that. Now, how do the weights W1 and W2 affect the loss with this partial derivative, and how does this bias affect the loss with this partial derivative? In other words, this partial derivatives tell us exactly in what direction to move each one of the weights and biases in order to reduce the loss. So that's exactly what we're going to do. We're going to calculate each one of these in order to reduce the log loss function. So let's simplify and only look at these ones over here. Del L over del W11, W21 and b1. And let's also look at del L over del W1 and del L over del b. So let's recall that the way we calculated the output of the red note was first we did set 1 is this summation, then we applied sigma to get a1, and then we did the summation of the inputs a1 and a2 times the weights W1 and W2 and the bias b. And we applied sigmoid to this set to get y hat. And then with y hat, we found the log loss of y, y hat. So there's a lot of variables in between L and W11. And we're just going to keep track of all of them to do a humongous chain rule. So in order to reduce this log loss, we need del L over del y hat. And the reason is because L the loss depends on y hat. Now y hat depends on that. So we need this derivative of 2. Del y hat over del z. Now z depends on a1. So we need del z over del a1, and a1 depends on Z1. So we need del a1 over del z1. Now, z1 depends on W11 so we need to del z1 over del W11. That's a huge chain roll. This is how it looks, del L over del W11 which is the one we wanted to find is equal to del z1 over del W11 × del a1 over del z1 × del z over del a1 × del y hat over del z × del L over del y hat the product of all this is a really long chain rule but that's what gives us del L over del 11, the one that we really want. Now let's calculate each one of these separately. What's del L over del W11? Well, this one's easy because it's del z1 over del W11 is the one over here. Each one of these terms is actually really easy. What is this derivative? Well, it's simply X1 because W11 is the variable, X1 is a constant accompanying it and everything else is a constant. So has derivative 0. Now let's move to del a1 of our del z1. That one is easy too because it's a sigmoid and the direction sigmoid sigmoid times 1- sigma. So this is a1 × 1- 81. Now what is del z over del a1? Again, this is a linear equation. So W1 is a derivative and what's del y hat over del z. This is again a sigmoid. So it's y hat × 1- y hat. And finally, what's del L over del y hat. That's the harder one but we've already calculated this one plenty of times. So our derivative del L over del W11 is the product of all these things. These two cancel out. And so we get this expression over here. So in order to find the optimal value of W11 that gives the least error, we performed gradient descent with this formula. And what's del L over W11. It's simply this over here. So that is how we change W11. And as you can imagine, that's how you change any Wij, you just have to change the subscript here. But I'll tell you how to do it in a minute. But for the sake of redundancy and to really nail this down, let's do the bias. So in order to reduce this again, we're going to go a little faster, when a del L over del y hat. And that depends on set, which in turn depends on a1 which in turn depends on Z1 which in turn depends on b1. So the long chain rule is this one over here, product of all these derivatives is equal to del L over del b1. So let's calculate one of these separately and quickly because we've already done it before for most of them. What's del L over del b1 is the product of del z1 over del b1. Which now it's easy. It's 1. Why is it 1? Because b1 is the only variable here. So this may as well be b1 plus some constant. And the derivative of that respect to b1 is simply 1. And the rest, we already know this is the signal function. This is a linear function. This is again the sigmoid function, and this one we've already calculated before. So things cancel out, and the way we update b is in order to find the optimal value b1. Then we perform grading descent with this del L over del b1 with calculator already and it's this. So let me summarize, the way we update W11 is like this. For W12, it's a slightly different. For b1 is this and then for the other three weights and biases, it's this. So in short, if you do these updates for a learning rate alpha, you get better weights and biases. So that's how you update the first layer of the neural network. Now let's update the second layer. But this one is much easier because we have to keep track of less things. So here is how you do it. You want to minimize L, y, y hat so you want to actually decrease it. Del L over del y hat is the first training that you need because L depends on y hat. Y hat depends on z. Z depends on the W1. And therefore you need this chain rule over here in order to calculate del L over del W1. And let's calculate it from its separately again. So what's del L over del W1? It's distributed over here which is a linear function. So it's a1. This one over here is a sigmoid so the derivatives y hat × 1- y hat, and this one over here we calculated already. So things cancel out, and we get something very simple. That to find the optimal value of W1 you perform great in descent with this formula over here and since you've already calculated derivative, you get this, this is high update W1. I'm going to rewrite it like this by turning a negative, a double negative into a positive. And this is how you update W2, and this is how you update the bias. So in short, we've seen how to update each one off the weights and bias of the neural network. So when we do this, we actually get a much better set of weights and bias. And that's how you train a neural network.