0:00

This lecture is going to be about K-means clustering.

Â And K-means is actually a relatively old technique that was developed

Â quite a while ago but it's, it remains very useful we're

Â going to, were kind of summarizing high dimensional data and we're getting

Â a sense of, you know, what, what patterns your data shows.

Â wha, What kinds of observations are similar to each

Â other, and what observations are very different from each other.

Â So we're going to talk a little about K-means

Â works, and show what it can do for you.

Â 0:24

So the basic principle behind

Â K-means clustering is that to first define how, what does it mean for things to be

Â similar to each other and what does it

Â mean for things to be different from each other.

Â So in some sense you want to define you know?

Â What does it mean to be close?

Â How do we group things together, and how do we visualize this grouping?

Â And then once we visualize the grouping how do we interpret what we see?

Â So I think The basic.

Â The most important thing is defining: what do we mean by close?

Â And we need a distance metric

Â to define you know, what it means for two things to be close together.

Â Because depending on the context, two things could

Â seem close, but not, but not be very close.

Â 1:00

And then in a different context, you have, you

Â could have a totally different meaning of, of distance.

Â And so this is a very important step.

Â And if you don't do it well. You end up with nonsense at the end.

Â So, get kind of the classic garbage in, garbage out.

Â 1:12

So the couple of t, traditional distance metrics and

Â this in, statistics are, you having a notion of continuous distance.

Â Which is like euclidean distance, so this

Â is like the straight line between two points.

Â 1:23

Another continuous measure is basically like a correlation similarity between.

Â Let's say two variables and so you can see

Â how correlated they are, or how similar they are.

Â So highly correlated points would be similar.

Â 1:34

And then another distance is called the Manhattan distance.

Â And this is kind of, we'll, we'll talk a

Â little bit more about this a little bit later on.

Â So, you need to pick a distance or

Â a similarity metric that makes sense for your problem.

Â 1:49

So k-means clustering is a way of partitioning a

Â group of observations into a fixed number of clusters.

Â And so the idea is that you have, you have to

Â set, say beforehand how many clusters of data, of points are there.

Â So for example, if you have 100 observations in your data set you

Â might think that they could be neatly divided into say four different groups.

Â Okay?

Â So you have to have a sense of how many clusters there are first.

Â And then each of these groups

Â is going to have a centroid. Right?

Â So it's going to be like a center of

Â gravity around which each group kind of gathers.

Â 2:22

And then once you have these centroids, we kind of defined we

Â kind of assign each of the observations to each of these centroids.

Â And that's how we do the group kind of assignment.

Â And so the basic approach is to kind of pick some, or the

Â algorithm for writing K-means is to kind of pick some, pick a centroid.

Â You know, assign all the points to the centroids.

Â Then maybe recalculate the centroids and reassign all the points.

Â So you kind of iterate back and forth until you reach a solution.

Â And so the things that you need, you need a distance metric.

Â You need a number of clusters, so a

Â fixed number of clusters that hit the specify beforehand.

Â And you need an initial guess as to where the centroids are.

Â And often you might just pick some random points, just

Â to start the algorithm in terms of where the centroids are.

Â But K-means clustering algorithm will produce

Â a, a final kind of estimate of. Where the cluster centroids are.

Â And it will tell you which centroid each observation is assigned to.

Â Here's a quick example of how you might use the K-means clustering algorithm.

Â I've generated just some random data here that

Â are in two dimensions so it's easy to visualize.

Â And so the x coordinates and the y coordinates

Â all come from a normal distribution with different means.

Â So I specifically

Â created three different kind of clusters for these twelve observations.

Â So each cluster has four observations in it.

Â So when I plot the data, it's very obvious that there are three clusters.

Â And I put labels on each of the points.

Â 3:47

So here we see that the algorithm started.

Â And I've just chosen some three random points.

Â To be the centroid.

Â So you can see there's kind of red plus, and a purple plus,

Â and an orange plus down at the bottom.

Â So those are the random starting points for the centroids.

Â You're going to see the first thing we're going to do.

Â Is we're going to take each of our data

Â points and assign to it the closes centroid, right?

Â So you can see for example point number eight at the very top.

Â The closest centroid is the red plus over in the upper left.

Â So that gets assigned to that cluster.

Â And then point number four down in the lower

Â left here, that gets assigned to the red cluster also.

Â And then the three points one, two, and three they

Â get assigned to the orange plus in the orange cluster.

Â And then, the rest of the points are closest to the purple little plus there.

Â So, they get assigned to the purple cluster.

Â And so, that's the first kind of initial grouping

Â of the of the points to the three different clusters.

Â 4:40

And then the next thing we're going to

Â do is we're going to recalculate the centroid.

Â So now we have the cluster definition.

Â Every point is assigned to a cluster. We can recalculate

Â the centroids, so for example, by taking the mean of that cluster.

Â So now you can see that the purple pluses has

Â moved slightly to be in the middle of that cluster.

Â The red plus has moved a little bit to be in the middle of that cluster.

Â And the orange one is now in the middle of the three orange dots.

Â And so the, the the cluster centroids gets, get recalculated.

Â And then you kind of repeat the process again.

Â So you take each data point and you say well what cluster centroid

Â is it closest to now?

Â And I can see that point number four for example used to be in the

Â red cluster but now it's closer to the orange cluster so now it gets reassigned.

Â And then you see for example point number seven is now

Â closer to the red plus so its in the red cluster.

Â and, and then we recalculate the centroid from this, you see,

Â now the centroids are kind of moving toward you to the point.

Â So, you can see how the clusters are kind of forming as you go step by step but

Â the two processes are to assign the points to each

Â cluster, and then to update the cluster of the centroid locations.

Â