Now we know gradient is an increasing direction, given this, what we will do is to move along its opposite direction for minimization problem. That's why this method is called gradient descent. Its general strategy is that," I want to minimize the function, then at a given point, I want to know where I should move, maybe I should move along this direction, for example, if that's the case, that we want to first find the gradient so that its opposite direction is guaranteed to give us some minimization effect as long as we move for a small enough step size." Given a current solution x, we want to do one iteration and update ourselves to the opposite direction of our gradient. This is the opposite direction of our gradient at the particular value x. We want to do that for some value a, which is positive, so a is called the step size, we would stop when the gradient of the current solution is zero. If that's the case then obviously your gradient descent will give you no more improvement. We'll just stop at x, then we can stop. Now the question is, how may we choose an appropriate value of a? The thing is that your a should somehow be small so that you're ready guaranteed to have some improvement. If you move for too far, a distance maybe you would get some none improvement. Let's take a look again as a one-dimensional example, something like this. Suppose that you are here and you are asking yourself how to improve yourself. Your gradient would point to the right, and then you will say, "Then I'm going to move to the left to do minimization." When your a is small enough, that's good, but if you move too far up to here, if you choose your step size to be a too large value, then maybe eventually you get even higher. Your a needs some consideration, and of course, you don't want to choose a very small step size because that's going to make the speed of your algorithm very slow. Maybe we want more discussions and let's see an example first. Again, let's try to solve our favorite function, x_1 squared plus x_2 squared. Suppose we start at x_0, which is 1,1. One 1,1 is here. For this particular function, we all know what's an optimal solution, there's no way for you to do better than 0,0. Because the function itself has a natural lower bung, zero. 0,0 is optimal, for the optimal point is here. On the graph, I depicted two circles. They are some isoquant lines. I hope you all agree that this x_1 squared, x_2 squared is some parabola on your three-dimensional space, and in the minimum point is here, 0,0. Along this circle, all these points on the circle gets the same objective value. Again, all the point for the next circle gives you the same objective value. When you look at these circles, they form some kind of contour for your particular problem. This is the contour map, you may say that. What we want to say here is that suppose we start at 1,1, then let's try to find a gradient. The gradient instead we want to differentiate with respect to x_1. That's how we get this, with respect to x_2, we get this. Then the gradient at x_0 may be plucked in, you get 2,2. 2,2 is in this direction. We will move in the opposite direction. Let's tried to take a look. In that case, if we choose a to be 1.5, we're going to move from x_0 to x_1, which is 0,0. This would be optimal. If we move in this way, and then choose the right step size, then we would get to an optimal solution. But on the contrary, if we set a to be one, then what we will do is that we will move from 1,1 to 1,1 minus 2,2, which is -1, -1. We will move to here. The gradient would be pretty much the opposite direction. Then we will do that, and we will do it again and go back to 1,1. Okay? If your step size is fixed and is always one, then in that case, in this particular example, you will move back and forth from x_0, x_1, x_0, and x_1, and so on and so on. Your algorithm will not converge in that particular way. I hope this somehow convince you that choosing a step size is very important. If you choose a bad step size, it actually can be very bad. Your algorithm may move in the right direction. But with a bad step size, it's even may fail to converge. That would be very critical and we'd really need some way to deal with that. How to choose the step size? There are many different ways to choose a step size and the many scholars are doing research on that. One general idea is that if you want to pre-define your step size, then your step size should be to be decreasing, to make sure that you somehow get some converge result. Here, we want to introduce another strategy. We may instead look for the largest improvement. What does that mean? We're going to ask along our improving direction, which is the negative gradient. We're going to look for an A, which is a step size that can minimize the outcome, okay? That means we want to see how far we should go to reach the lowest point among these direction, okay? We know we are here. We also know that we want to move along this direction. If that's the case, we want to ask for how long we may get an optimal the minimum point, then we move there, okay? That's another strategy that many people try to use. We now may describe our gradient descent algorithm. Don't forget, this is just one way to implement the gradient descent algorithm. At least that you will have other ways to choose step size. But here, we just want to introduce you this one. Your step 0 is to choose a starting point and that is called x_0. Don't forget, x_0 is a vector, okay? A precision parameter Epsilon, later we will use it. Then for each step, k plus 1, we will try to find your gradient x_k. This means we are starting at for example, x_0. Then at x_0, we will try to find its gradient. Then we will try to move along the negative direction. Here we need to find a_k, which is our step size, which is the value that may minimize this particular thing. Along that direction, we want to find the best we may do and move to that direction. We've moved to that point. We will update our current solution from x_k to x_k plus 1 by moving this particular step size, a_k, okay? Then we will reach the lowest point that we may reach along our improving direction. Then we ask whether we are reaching a stopping condition. Pretty much the most popular stopping condition is that for our new gradient, for our new gradient, f of x at x_k plus 1, we want to take a look at its norm, which is calculated in this way, okay? Again, it's from your calculus textbook. We take each element of that vector. We do the sum of the squares and then take square roots. That's the most popular way to define a norm or a significance of a vector. If that particular thing is quite small, is smaller than epsilon, what does that mean? That means all the elements of your gradient is small. If that's the case, that somehow means, well, the improvement that you may get by doing the next iteration would be too small. If that's the case, pretty much used up. Somehow, if your Epsilon is small enough, like 10 to the power of negative six,10 to the power of negative eight, then this is pretty much saying that your gradient is zero, approximately zero. If that's the case, then please stop and then report your solution x_k plus 1. Otherwise, let k to be k plus 1, and then continue, okay? This is an iterative process. At a point, look for the gradient, move along the negative gradient, find the best we may do along that direction, move there, and then ask yourself whether you should keep going by asking whether your gradient is close to zero enough. That's the idea. Later, let's see some examples.