[MUSIC] Let's now turn to applying MapReduce to k-means. And first, let's recall the first two of k-means. There's one step, which we'll now refer to as the classify step, where we assign each data point to a cluster. Then we have what we'll now call the recenter phase, where we take all the data points that were assigned to a given cluster, and we use that to update our cluster centers. Well, let's think about these steps within the context of MapReduce. And in particular what we see is that for the first step, this is something that's a data parallel operation. That is, for every data point, once you give me the cluster centers, I can independently perform the assignment of that data point to a cluster center. And it does not depend at all on the other data points. And so in particular, the mapper is going to emit these cluster label, data pairs where here this cluster label is going to serve as the key and the data value is going to serve as the value in this key value pair. Then when we turn to the recenter step, we see that this is a step where we're performing an aggregation. And this aggregation is independent across different cluster labels, so across different keys. So in particular, for every cluster, I simply look at the data assigned to that cluster and I sum all the values and normalize. And so that can be implemented using a reduced step. So let's dig into each one of these steps in a little bit more detail. So in particular let's start with the classified step and look at the mapper for this step. And what the mapper takes in are the set of cluster centers, and a single data point. And then the mapper computes the distance between that data point and each one of these cluster centers and returns the index of the cluster center that's closest to that data point. So in particular, amidst the cluster label and data point pair, where again the cluster label serves as the key and the data point serves as the value. So for example, maybe we would emit (2, [17, 0, 1, 7, 0, 0, 5]). If this data point, which might have been a word count vector with count 17, 0, 1, 7, 0, 0, 5 on some set of words in a vocabulary, is assigned to cluster 2. So this is if Zi = 2. Okay, so hopefully from this it's easy to see how we can do this classified step using a map face. Then for the recenter step, we see that we're just aggregating all data points with any given cluster divided by the total number of data points in computing this as the new mean for that cluster. So our reducer is going to take in a given cluster label, which remember is our key. And then we're going to take in a list of values associated with that key. So here we have, these are all data points assigned to cluster j, so they have Key j. And what this reducer does is it starts with the total mass and the cluster being 0. And this is the total number of observations in the cluster. And we also initialize that to be 0. And then for every x value in this list, we're simply going to increase the total mass by an amount X, and then increase the total count number of observations in the cluster by 1. And then we're going to return a key value pair where the key is the cluster label and the value is the new mean. Which is total mass divided by total number of observations in the cluster. So again, we see how our recenter step fits very nicely within the reduced phase. So that's MapReduce for k-means. But there are a couple of things worth highlighting. One is the fact that remember that k-means is an iterative algorithm. So you run this classify, recenter, classify, recenter again and again and again until convergence. So this actually requires an iterative version of MapReduce which is a non-standard formulation. But it's possible. And the other thing to remember is that the mapper gets the entire set of cluster means. And there's lots of data to plow through. And remember, every map recall in the very general formulation or the naive implementation just gets a single data point and there's lots and lots of data points. So this is a lot of data to be passing around. So, a more efficient implementation is per mapper call, give a whole set of data points to classify. So a whole set of points to plow through. So that would be a much more efficient implementation of MapReduce for k-means. But to sum up, the steps that are involved in k-means fit really nicely within this MapReduce framework, where the map step corresponds to our classification step, which is data parallel over our data points, just assigning data points to cluster labels. And the reduced step corresponds to the recenter step in k-means, which is data parallel over cluster labels which represent they keys in this example. Where we're just going to, for every cluster, aggregate all the data within that cluster and divide by the total number of observations which is equivalent to computing the mean. [MUSIC]