[MUSIC] So let's talk about the k-means clustering algorithm. This algorithm has some weaknesses. But it's very, very, very popular, in part because it's so simple to understand. So it's a really good one to be very familiar with. So the k in k-means refers to the number of clusters you're looking for, which is the first, and perhaps the most prominent weakness, is that you have to know this upfront. Okay. So imagine you have two dimensional data scattered, as in the slide, and you're trying to find two clusters. The way you'd begin is, take the centroid of each cluster, and drop it into this space randomly, as an initial guess as to where that cluster would be centered. So maybe we say [SOUND] here and here. Okay. And then the algorithm proceeds as follows, for each point in the data set, figure out which of these two centroids it's closer to. This is closer to this one, this one is about halfway, but probably this one. This one's about half, but we'll say it's over here. So this is our initial guess of the clustering. We think that all these points belong to this cluster, and all these points belong to this cluster. So now you take the average of all the positions in that cluster, to compute a new centroid value, and move the centroid there. So for example here, this one will shift this way, and this one will shift sort of this way. All right. So in the second iteration, our points, our two centroids might look like this and then we just repeat the process. So, for every data point, figure out which centroid is closer to. About half way, how about we put it there. And this is our guess at time equals two, for the clusters. Now once again, average all the positions within that cluster to find the new centroid. So here, these two points will pull a little bit in that direction, but most of the points are this way, so it will probably shift that way a little bit. And similarly, with this one, it'll shift this way. So at time three, we have a centroid there, and we have this centroid, here. And now the close, once again, assign the points to the closest centroid. And re-compete the new centroid values. So this one will finally shift into the middle here, and this one will finally shift perhaps a little bit this way. And then if time equals four, we can repeat once more. And find that the centroids in this case don't move that much, on the next iteration. And so once the movement of all the centroids is below a certain threshold, they haven't moved much, things don't change much, things have settled down, we stop the algorithm, and that's our clustering. And so in this case, all of these points will be assigned to this cluster. And all of these points will be assigned to this cluster. So you can paralyze this algorithm, by splitting the data items across multiple machines. And one of the key ideas here is the number of centroids is pretty small, or at least small enough to fit in memory, right? We're not typically looking for billions of clusters, and so you can broadcast those to every map task, in say, a MapReduce setup. So, in the Map phase, the mappers look at these data point, and they have access to all the centroids, and they can figure out which one each data point is closest to. And then send that to the Reduce phase, according to the cluster that it was assigned to. Okay, so all that can happen in parallel. And then on the Reduce side, there's one reduce task per cluster. And it can compute the new centroid value. So there's a big of a weakness here, because one reduced task could have a lot of work to do, because it might have most of the data points, and that could be a problem. But in principle, things are balanced. You might get a pretty good parallel speed up. And the other weakness with this MapReduce implementation is that, there's no support for this iteration of k-means. We have to repeat this over and over again, and see if that some kind of external driver program, that will keep kicking off MapReduce drives one at a time. And we'll talk a little more about this in the final week of lectures. So summarizing the weaknesses of k-means, first of all, you have to know the number of clusters up front. You have a fair amount of sensitivity to the initial starting conditions, in that there's no unique solution. Depending on the starting conditions, you may get a different answer. And there's also some sensitivity to the stopping threshold. There's, as is always the case, there's been a lot of work on repairing these various problems. But when someone just says k-means unqualified, they're typically talking about the algorithm we just described, which is sensitive to these issues.