Here let's start with the method under consideration, the optimizer sorry. We have batch, mini-batch gradient descent. We already discussed about. I'm going to go through it a bit, but we discussed gradient descent during the time that we were discussing model calibration. The difference between batch and mini-batches in the batch gradient descent, we're using the entire dataset. That means for every iteration, it's always we are using the same dataset. In mini-batch, what we are doing it at each iteration, we compute the loss function for some subset of total dataset that's speeding it up. That means if you have four years of data, as opposed to every time we are using four years, they may use the first iteration, the first month; the second iteration, the second month; and we keep coming to the end and we go back again to the first month, second months, third. By doing this actually, what we are doing is sometimes it's actually gives us better results, but definitely it makes it computationally less expensive because, as opposed to using the entire dataset, we're using a portion of that dataset. We have an objective function. We are trying to minimize that objective function, but there are the way you should looking at this one is looking at it from the perspective that you are starting from some initial condition. What you're doing is you're setting up the gradient evaluated at that. Then you know the gradient goes pointing to the maximum point. What we are doing is want to point to the minimum door, the optimal minimum. That's why we always look at this negative sine. In order to control how fast we're going there, we also multiplying by so called Gamma here. That's what we exactly seeing what I am writing here. That means when you start from Theta_naught, the initial point, this is how you set it up. You do the gradient. Then this would take you down, and then this would give you a new direction. Then from there, you do the new evaluation at the new point, and what it gives you the second point. I keep going through this iteration. Then one more time, don't forget that this gradient is nothing but what I'm exactly having here. If you have, say, depending on the gradient means, let me clean this one up to make sure it's clear. I'm sure you know it but does not hurt to write this one down. Then this means I'm doing that for every component depending on how many components I'm having. I'm showing you what the general form for it as, for example, in having a_naught, a_1, a_2, a_3. The first one is with respect to a_naught. The second one with respect to a_1. The third one with respect to a_2, and the last one with respect to a_3. This is how this is done. You can actually do it by hand. It's very simple to do it. Now, the mini-batch, you do exact same thing, but when it comes to the evaluation, as you see, you're just utilizing part of the data each time. Now, I have the implementation of this for you. I'm going to go through it. You will be having the code. You can play it with it yourself. What we are doing is just for visualization purposes. I'm going to do 5 versus 2. You can't do 30 versus 2, 5, 10. Then you cannot visualize in two dimension. Then I'm doing 5 versus 2. I'm going to go through the code to explain to you how I am done it but the way I'm doing it is, I set this one up just to make sure that it's clear. For this case, this is hat, I set up my last function or objective function which you remember the expression for that. I from 1 to t of y-hat_i minus y_i square, and this y hat is nothing but this. In order to visualize this space, I'm looking at I set up a grid for a_naught and a_1. This is the grid I'm setting up. That's a grid. This is a_naught. This is a_1. This is actually the way you should look. These are the contours for the loss function. This is the loss function that you seeing here from the top. This is the contour for the loss function. This is for visualization, easy to see that this point is actually the optimal minimum, wherever that point is. Now, how I do this is, first of all, let's look at this contour. What did I do this? Because by looking at it, you can see that the closer the contours, the steeper the slope. That's what is important to control the speed. If I write this form one more time, you needed that Gamma to make sure you control your speed, and you're going to see that actually if the Gamma is too small, it takes you long time to converge. If that Gamma is too large, then even for a very simple program, you're not going to converge. You need to make sure that you control it. The direction of the steepest descent is always perpendicular to the contours. That's what I mentioned. This direction is known as the gradient as exactly what I mentioned earlier. By evaluating the gradient or current position, you can find the direction of the steepest descent and take the step in that direction. That's exactly what we did here. I just want to make sure that you know where the formula is coming from. Keep doing this, hopefully, I should have said here, hopefully, we find global minimum. Why do I say hopefully? Because if the surface is not smooth, you may not be able to get there. We're assumption is the surface is smooth. By looking at this, it shows the surface is smooth because we see the contours are lined up so nicely. Also I'm just saying a bit more. At each step of moving perpendicular to the contour, we need to determine how far we want to go before we calculated a new direction. That's what the what we do by introducing Gamma. Good. As we get closer to the minimum, we're supposed to move slower as we're far from it. We can actually go a bit faster. As we get closer to the minimum, it gets flatter. We want to make sure that we control the speed. I keep saying this because I want to make sure that it does not actually blow up. Good. Now, in general, as I said, we multiply the gradient by a factor Gamma. I already said that this is what we call learning rate to control that. Now, but the question is, for what Gamma and what n, by what n means when it comes to this iteration? Are you going to go until infinity or at some point you have to actually stop? Now, then what's important are Gamma and this N. Do not forget, we never find the minimum if Gamma is too big, too large even for a very simple problem. If Gamma is too small, we would need much larger end. There is always a compromise. You need to be careful with your Gamma. What I'm doing here is this. For that very simple problem, I'm going to go through the code as well, I'm going to go with five different learning rates. I start with a very small one. Gamma is 0.005. For this case, that's the data. You want a find a line going through it. As you see, you don't find the line. As a matter of fact, this was a premature stop. Assuming that the maximum iteration for me was 30, I'm seeing that 300. I'm not close. It did not converge because Gamma was too small, starting from minus 100, minus 100. As you've seen why I'm starting from here. I was about to come, but it never came over here. Stop prematurely. Let's go to the next one. Now, this time, I'm making Gamma to be 0.1. For this case, actually, I'm seeing that it starts from here and comes in just 260 iteration. It didn't even come to 300. We stop because we meet the condition, whatever condition you need to set up. As you see, this line seems to be the best one. This is the number that you can go back to the regression and that's the number that we got from that case as well. We are going to the next start. This time, Gamma is 0.2. For Gamma 0.2, we stop at 140. Actually, it's converged nicely. I see to same line and this number is roughly the same as the other number. It's off maybe by something that you could easily ignore. As you see, when our pivot and then came back and eventually started converging. In next slide, I start increasing Gamma at this time. I go to 0.3. I', maybe pushing my luck a bit but still it converge. As you see, the 95 iterations was enough, but then it starts going up and down, but eventually started converging, and you see this is the best possible line that it can get. Not a bad job. Intentionally, in the next slide, I'm making it too large. That means by too large, just going from 0.3 to 0.33, you will see that it is start actually diverging and it cannot find that the data is already here and the line goes over here. I'm saying that terminal point actually diverged a Gamma was too large. The whole idea behind what I'm trying to do here was knowing that even a very simple problem with a very simple objective function, if you try to utilize gradient-based optimization, you need to be very careful with the choice of learning rate. We didn't have this problem with linear meet. We did not have this problem with linear regression. However, we do it in gradient-based. Having said this, in many cases, we cannot use things like linear regression because it's simply not a linear regression problem. In general, if we try to use gradient-free optimization, in many cases, actually, they would be the worst choice because it becomes very expensive. Then gradient-based are good approaches as long as you are smart about the choice for your learning rate and that's what I was trying to do in the last five slides using various learning rates to show you that if you're not careful with it, even for a very simple problem, either it would then converge or because it's too small or it would diverge because it was simply too large. Now, in the next two lectures, I'm moving from as I said from this data-driven analysis to now going to those liquid instruments, LIBOR and swap rates assuming a model for the evolution of interest rate and do the so-called model calibration and get to be able to actually find the zero coupon bond curves for every maturity that we have and then see how well we can do that job. Thank you.