In the earlier courses, courses one and two of the specialization, you saw a lot of supervised learning algorithms as taking training set posing a cost function. And then using grading descent or some other algorithms to optimize that cost function. It turns out that the key means algorithm that you saw in the last video is also optimizing a specific cost function. Although the optimization algorithm that it uses to optimize that is not creating dissent is actually the algorithm that you already saw in the last video. Let's take a look at what all this means. Let's take a look at what is the cost function for a means, to get started as a reminder this is a notation we've been using whereas CI is the index of the cluster. So CI is some number from one Su K of the index of the cluster to which training example XI is currently assigned and new K is the location of cluster centroid k. Let me introduce one more piece of notation, which is when lower case K equals CI. So mu subscript CI is the cluster centroid of the cluster to which example XI has been assigned. So for example, if I were to look at some training example C train example 10 and I were to ask What's the location of the constant century to which the 10th training example has been assigned? Well, I would then look up C10. This will give me a number from one to K. That tells me was example 10 assigned to the red or the blue or some other cluster centroid, and then mu subscript C- 10 is the location of the cluster centroid to which extent has been assigned. So armed with this notation, let me now write out the cost function that K means turns out to be minimizing. The cost function J, which is a function of C1 through CM. These are all the assignments of points to clusters Androids as well as new one through mu capsule K. These are the locations of all the clusters centroid is defined as this expression on the right. It is the average, so one over M some from Michael's one to him of the squared distance between every training example XI as I goes from one through M it is a square distance between X I. And Nu subscript C high. So this quantity up here, in other words, the cost function good for Kenyans is the average squared distance between every training example XI. And the location of the cluster centroid to which the training example exile has been assigned. So for this example up here we've been measuring the distance between X10 and mu subscript C10. The cluster centroid to which extent has been assigned and taking the square of that distance and that would be one of the terms over here that we're averaging over. And it turns out that what the K means album is doing is trying to find assignments of points of clusters centroid as well as find locations of clusters centroid that minimizes the squared distance. Visually, here's what you saw part way into the run of K means in the earlier video. And at this step the cost function. If you were to computer it would be to look at everyone at the blue points and measure these distances and computer square. And then also similarly look at every one of the red points and compute these distances and compute the square. And then the average of the squares of all of these differences for the red and the blue points is the value of the cost function J, at this particular configuration of the parameters for kings. And what they will do on every step is try to update the cluster assignments C1 through C30 in this example. Or update the positions of the cluster centralism, U1 and U2. In order to keep on reducing this cost function J. By the way, this cost function J also has a name in the literature is called the distortion function. I don't know that this is a great name. But if you hear someone talk about the key news algorithm and the distortion or the distortion cost function, that's just what this formula J is computing. Let's now take a deeper look at the algorithm and why the algorithm is trying to minimize this cost function J. Or why is trying to minimize the distortion here on top of copied over the cost function from the previous slide. It turns out that the first part of K means where you assign points to cluster centroid. That turns out to be trying to update C1 through CM. To try to minimize the cost function J as much as possible while holding new one through mu K fix. And the second step, in contrast where you move the custom centroid, it turns out that is trying to leave C1 through CM fix. But to update new one through mu K to try to minimize the cost function or the distortion as much as possible. Let's take a look at why this is the case. During the first step, if you want to choose the values of C1 through CM or save a particular value of Ci to try to minimize this. Well, what would make Xi minus mu CI as small as possible? This is the distance or the square distance between a training example XI. And the location of the class is central to which has been assigned. So if you want to minimize this distance or the square distance, what you should do is assign XI to the closest sanctuary. So to take a simplified example, if you have two clusters centroid say close to central is one and two and just a single training example, XI. If you were to sign it to cluster central one, this square distance here would be this large distance, well squared. And if you were to sign it too close to central to then this square distance would be the square of this much smaller distance. So if you want to minimize this term, you will take X I and assign it to the closer centroid, which is exactly what the album is doing up here. So that's why the step where you sign points to cluster century is choosing the values for CI to try to minimize J. Without changing, we went through the UK for now, but just choosing the values of C1 through CM to try to make these terms as small as possible. How about the second step of the K-means algorithm that is to move to clusters centroids? It turns out that choosing mu K to be average and the mean of the points assigned is the choice of these terms mu that will minimize this expression. To take a simplified example, say you have a cluster with just two points aside to it shown as follows. And so with the closest centroid here, the average of the square distances would be a distance of one here squared plus this distance here, which is 9 squared. And then you take the average of these two numbers. And so that turns out to be one half of 1 plus 81, which turns out to be 41. But if you were to take the average of these two points, so 1+ 11/2, that's equal to 6. And if you were to move the cluster centroid over here to middle than the average of these two square distances, turns out to be a distance of five and five years. So you end up with one half of 5 squared plus 5 squared, which is equal to 25. And this is a much smaller average squared distance than 41. And in fact, you can play around with the location of this cluster centroid and maybe convince yourself that taking this mean location. This average location in the middle of these two training examples, that is really the value that minimizes the square distance. So the fact that the K-means algorithm is optimizing a cost function J means that it is guaranteed to converge, that is on every single iteration. The distortion cost function should go down or stay the same, but if it ever fails to go down or stay the same, in the worst case, if it ever goes up. That means there's a bug in the code, it should never go up because every single step of K means is setting the value CI and mu K to try to reduce the cost function. Also, if the cost function ever stops going down, that also gives you one way to test if K means has converged. Once there's a single duration where it stays the same. That usually means K means has converged and you should just stop running the album even further or in some rare cases you will run K means for a long time. And the cost function of the distortion is just going down very, very slowly, and that's a bit like great inter sent where maybe running even longer might help a bit. But if the rate at which the cost function is going down has become very, very slow. You might also just say this is good enough. I'm just going to say it's close enough to convergence and not spend even more compute cycles running the album for even longer. So these are some of the ways that computing the cost function is helpful helps you figure out if the album has converged. It turns out that there's one other very useful way to take advantage of the cost function, which is to use multiple different random initialization of the cluster centroid. It turns out if you do this, you can often find much better clusters using K means, let's take a look at the next video of how to do that.