Welcome back for the second part of this demo on gradient descent. As we mentioned in the last video, when we're working with other problems such as working with neural networks, we are not going to have this analytical solution that we just found. Instead, we're going to have to move towards that optimal value using gradient descent. In order to see this in practice, we're going to actually pick a learning rate as well as a number of iterations, run the code and plot the trajectory as we move towards our gradient descent. If you recall in lecture, that means that we're going to pick how big each one of our step size are going to be, and then see step-by-step as we move closer to the optimal value, how we're actually moving towards each one of the different Thetas that we're trying to predict. Then using that, we'll find some examples where the learning rate is too high, too low, or just right. We're going to start off with a learning rate of 1 times 10 to the negative three, so 0.001, and we're going to say we want 10,000 iterations, so 10,000 different steps, and we're going to initialize with a value of 3, 3, 3. In order to actually perform gradient descent, we're going to pass in a learning rate, which we defined above, the number of iterations, and the Theta initial, all defined up here above. The initialization steps will be, we will set our Theta originally equal to that Theta initial, which is at this point 3, 3, 3. We're then going to set the Theta path at first equal to just a bunch of zeros in the shape of the number of iterations plus one. So if we're doing 10,000 iterations, there'll be 10,001 rows, and each one will be three columns. So it will say for each one of our different values, which are B, Theta 1, and Theta 2, how we're moving closer and closer to each one of those steps through each one of the different iterations, and then we're setting that first value, so Theta 0 for all the columns equal to that Theta initial. Then just to start off, we're going to set the loss vector equal to np.zeros, and we'll see the loss as we move through each one of the steps to see if we continue to minimize that loss, and we'll do that for every single iteration. We're then going to do this main gradient descent loop, which is going to be what we discussed in lecture in regards to starting at the initial point, finding the gradients, and then using that gradient to move closer and closer towards our optimal value. We're going to set our prediction equal to the dot product of our different Thetas times our x matrix. If we think about that as taking our entire X matrix, and if we take the dot product of the transpose of Theta and the transpose of the x matrix, then all we're doing is multiplying, in this case, initializing with 3, 3, 3, three times the first value for all of them, three times the second, three times the third. Adding those altogether and getting our first prediction for each one of our different Y values. Our loss vector, which we defined up here, we'll then say that first loss will be equal to the sum of the square of what we predicted minus the actual values minus what we predicted, so we'll get the mean squared error or just the squared error, and then our gradient vector, which we didn't go through in lecture, but just to know what the gradient actually looks like as we're taking those partial derivatives. What it'll look like is going to be that error on the prediction, so y minus y pred, take that and the dot product of x map. So that's actually going to be equal to that gradient vector. That's how we can come up with that gradient vector, and it will be of the size that we need in regards to subtracting, or in this case, this will actually be the negative of it. So we're actually later on going to add it on. We see here that we add on that gradient vector, but that's actually going to be equal to the negative of our gradient vector. We divide that by the number of observations that we have, and we'll use that in order to move a step closer towards our actual data. So at first, our Theta is 3, 3, 3. We got that gradient vector, and we set 3, 3, 3 plus that learning rate multiplied by that gradient vector, which should also be of the same shape as that 3, 3, 3. So row vector with three values. Then we say that Theta i plus 1 is equal to that Theta that we just found. So we reset the Theta and then we go back through the four loop using that new Theta, and coming up with our new prediction, our new error, our new gradients, and then our new Theta values. Then we're going to return after we go through that entire four loop, the entire Theta path as we go through each one of the different iterations, as well as the full loss factor. How far off are we from our actual solution as we move down the line? If we recall, that loss vector is just going to be the sum squared error. So we have our gradient descent function, and then we're going to actually plot this out. I'm going to quickly walk through this. I'm not going to go through every single line of code, but I do want you to get some intuition as to what we're plotting here. We have our true coefficients, which are just equal to the B, the Theta 1, and Theta 2 that we defined earlier, and then we say plot i, j, and what we're doing here is we're plotting two of these different values, so either B versus Theta 1, or B versus Theta 2, or Theta 1 versus Theta 2. Those are going to be each of our three plots. So plot i, j, we're going to plot the actual true coefficients. The true coefficient of, if we say i and j are equal to v and Theta 1, then we'll say we want the 0th value, which will be v, and the first value, so j would be Theta 1. We're just going to plot the actual values and then mark those as the true coefficient, we're then going to plot the Theta path' we're going to plot that path, again we're only using two of the dimensions at a time. Let's say again, working with b and Theta 1, then we'd plot the path of, if we set i equal to 0, then we want for the Theta path only that first column, which will be the different values for b, then for that second column the different values for Theta 1. If we set j here equal to 1. Then we can say what the initial value is by calling Theta path 0, i and j and label that as the start,and also to note here, we are taking each one of the steps that we take from the start to the end, and using the dashed lines as well as a marker for each step of the triangle and then finally we're going to say negative one and i and negative one and j for Theta paths in order to get the final value and we'll label that the finish. Then that's just going to be a subset, one we call plot all. What we're doing is we're just taking each one of the different combinations of axes that we can. We'll have v versus Theta 1, v versus Theta 2, and Theta 1 versus Theta 2. As we see here, we're calling plot i j 0,1, then plot i j 0,2, and then plot i j 1, 2 on each one of our different axes. Then on top of that we're also going to plot our loss factor to see how we reduce the actual loss function as we iterate and move closer and closer to the true values. With that, we get our gradient descent to output both our Theta path and our loss vector, we can then pass that into our plot all function that we defined with our learning rate, our number of iterations, and our Theta. We run this. First, we have to obviously run what we have here, we run that. This will plot out the actual steps that we take. Here we see the Start and we see each one of these triangles and are moving closer and closer. This axis here, if we look at the top left plot, we see the x axis is Theta 0, the y-axis is Theta 1, and we want Theta 0 to be moving towards 1.5, that's our b. We want the Theta 1 to be moving towards two. Then if we look at the actual values, it looks like Theta 0 stopped that around 1.9 and Theta 1 also stopped that around 1.9. Then if we look at the top right graph, we see Theta 0 is still the x-axis, so also still at 1.9, but now the y-axis is Theta 2, and that's actually pretty close to five already. We can see each step it takes along the way. Then here we can see Theta 1 versus Theta 2, which started as we suppose all these are starting at 3,3, then it should have moved towards that 2 and 5 that we'd like. In the bottom right graph, we see the number of iterations and we see pretty quickly, once it gets to, it looks like maybe a hundred, two hundred iterations drops off to getting a very low error. That it slowly, very gradually continues to minimize the error, but not quite as much, which is probably why that learning rate, why we haven't gotten all the way to our optimal values, and we have the gap between some of our endpoints, our actual true values, and the endpoints that we got using gradient descent. Now I quickly want to show you what this looks like. If we were to decrease the number of iterations, so if we decrease this to say 100, and we run all this, we see that it stops a lot earlier along the way. The gradient descent didn't get to quite finish. Now if we keep this at 10,000, and we decrease the learning rate, which allows for larger steps. We can see here, that it actually reached the optimal solution. Decreasing the learning rate allowed for those larger steps and we were actually able to get to those optimal solutions. Finally, I want to show you what will happen if you set that learning rate too large, now if we run this, maybe a little bit difficult to see, if you see at the top left corner of the top left plot, we see that we're talking about massive numbers. It's 0.2 times 10 to the 305. We have missed our value by a long shot. We totally over blew the optimal value. If we look at the actual error rate, we see as we increase the number of iterations, the error shoots way up as we completely miss the optimal value. That closes out our video here, working with vanilla gradient descent. In the next video, we will go through the same steps and briefly walk through how you can do it using stochastic gradient descent.