Now we are ready to talk about gradient descent. The idea is following. Suppose we want to minimize a non-linear function like this. You'll probably need some little bit imagination. But anyway, these function, this example here is you want to minimize x_1, I mean x_0,x_1. It doesn't really matter how do we name these variables. x_0 square plus x_1 square. Somehow it look like a bowl or I mean, B-O-W-L. Looks like a bowl having an upward curvature. In that particular case, pretty much what do we want to do is the following. We're going to start at whatever point we have. This is an unconstrained problem, so we may start at any point. At any point we would consider its gradient. Its gradient is something that we know how to calculate. The gradient is an n-dimensional vector. This may sometimes be misunderstood. The gradient is some kind of partial derivatives. In many cases, when you think about partial derivative, you may think about some kind of slope. That may be true when you are talking about a single-dimensional problem. When you only have one decision variable, you may have that interpretation. But when you have multiple decision variables in n-dimensional space, the gradient actually is a direction where if you move along that direction, then you may try to improve your current solution by moving along that direction. Later we will give you specific ideas about that, but for this one-dimensional problem, if at a point you talk about the gradient, that's pretty much whether you want to move to your left or to your right. All right? When you see the gradient is three, for example, that means you want to move to your right because your gradient is positive, or if you see negative six, then you will want to move to your left. Something like that. In a two-dimensional problem, if for example here, you see that your gradients suppose is this direction, then that means you want to move along that direction. Something we want to emphasize here. The gradient is a direction to move. It is not the slope. We want to know whether the gradient gives us an improving direction. Here, improving means, we want to minimize the function. We want to improve by getting to a lower point for a given function, whether it is one-dimensional or two-dimensional or whatever. At a point, we have the following proposition. For a twice differentiable function its gradient is an increasing direction. That means what? If you move along that direction, so you are at x. You know x is a vector. It means some location. Your two-dimensional space, your x may be here. You know, you have a gradient. Gradient is another vector that say your gradient is this direction. Then if you move for a small step size, if you move by a, so your a may be small like this. This is your a gradient f. From x you move along that direction with a a. That means you will move to a not so far point. For that point you are going to see that it is greater than your original point, the functional value would be greater than the original point. What does that mean? If you move along the gradient, so certainly we don't guarantee that you will keep improving forever. But if you don't move too far, for all those a that is small enough, you will see some improvements. In your one-dimensional example, if you are here, then your gradient is going to tell you, if you move along this direction, then you are going to get some improvements. If you are here, on the other hand, then your gradient again is going to be some positive value, definitely. It guarantees you that if you move for a short distance, then you would see improvements. It's not always true for A, but it must be true for local points for a, so that's the idea for gradient. Well, this actually can be proved. How to do that? Recall the following. We were having this particular thing. If you consider this limit, this limit says that from x, I want to move a longer direction d with a small step size. My a goes to zero. For this particular function, I'm talking about what's my resulting functional value, and what's my initial functional value. In a one-dimensional problem, suppose it is like this. This is our f of x. I mean, this is our x and then this would be our f of x red line segment. Then if I move for a little bit distance x plus ad, then this would be my second functional value. This is my second functional value. Their difference would be this one. Their difference is going to decrease when you make a smaller and when a really goes to zero, the two points would coincident with each other and then that would give you the slope in some sense. That means the left-hand side is the slope for this particular function f at the point x. If that's the case, then also we know one thing. We know that from the calculus textbook, you will have learned that this particular limit may be calculated as first, you'll find a gradient and the second you do one tiny step, directional derivatives. Pretty much you say, okay, this particular ffa thing is going to give you the gradient and the direction that you want to move gives you this particular direction. In that case, this is somehow says, oh, okay, so if I know at a point my gradient is here and also I know I want to move along this particular direction, then that particular slope, that particular quantity would be determined by the product. What we have now is that, well, we have the following. This particular limit is going to be the gradient times the gradient. If that's the case, then this is pretty much some improving direction because this is some square and that's going to be non-negative. Or if your gradient is not zero, then it is guaranteed to be positive. Don't forget that this particular d vector is a vector. Here you should have a transpose. Your gradient is also a vector. Originally it's a column vector. That's why you do a transpose to make it a row vector, and then a row vector may be multiplied with the column vector. As long as your a small enough, this is going to be greater than your original value. This has been proved that the gradient gives you an improving direction. In fact, the gradient is giving you a fastest increasing direction. At a point, if you want to move, if you want to increase as much as possible, moving along the gradient is at least a locally optimal strategy. Later, if we want to minimize our function, we will first find the gradient like this, and then we will move along the opposite direction. That's going to help us minimize our self as soon as possible.