0:00

So just to illustrate how hierarchical clustering works I'm

Â just going to simulate a very simple data set here.

Â So I always set the seed so that the data are reproduced here.

Â You can just run this code and simulate the

Â data for yourself and take a look at it.

Â And so I plotted a total of 12 points here and there are three clusters we can

Â see very clearly from the plot, and I've

Â labelled each of the points using the text function.

Â So you can see, you know,

Â which point is which.

Â And so I'm going to run the hierarchical clustering

Â algorithm to see how the points get merged together.

Â 0:34

So the first thing you need to do to run an hierarchical

Â clustering algorithm is to calculate the

Â distance between all of the different points.

Â So, you need to calculate all the pairwise distances.

Â Between all the points so you can figure

Â out, you know, which two points are closest together.

Â And so the easiest way to do this is with the dist function

Â in R.

Â The dist function takes a matrix or dataframe.

Â here, this dataframe has two columns, the first column

Â is the x-coordinate, the second column is the y-coordinate.

Â 1:00

And then and so it's basically a matrix or a data frame of points.

Â And what the dist function does is it calculates the distance between all the

Â different rows in the data frame, and

Â it gives you what's called the distance matrix.

Â Which gives you all the pairwise distances.

Â And so if you call

Â dist without any other arguments it

Â defaults to the Euclidean distance metric but

Â you can give it some other distance metrics if you want as options.

Â And so here you.

Â Down below here you can see the most

Â of the distance matrix that is returned by dist.

Â So you can see that it's a lower triangular matrix.

Â 1:35

And it gives you all the pair wise distances.

Â So for example, the distances between point one, and point two is 0.34.

Â Of course,

Â the actual distance here is meaningless, I just.

Â because I just simulated the data, so the numbers aren't particularly meaningful.

Â But you can see that some distances are

Â farther apart, some points are farther apart than others.

Â So, for example, the distance between point 3 and point 1 is

Â 0.574 and the distance between point 3 and point 2 is 0.24.

Â So, so the, so point 3 is closer to point 2 than it is to point 1.

Â And so you can kind of go down the line like this and see

Â how far apart each of the various points are from each other.

Â [BLANK_AUDIO]

Â And so if you look, so the idea of the hierarchical clustering algorithm here is

Â that we're going to take the two points that

Â are closest to each other from the start.

Â And so that happens to be points five and six,

Â and so I colored in here in the orange and

Â that idea is that the five points, five ans six,

Â because of the closest together we're going to group them together.

Â And merge them into a single cluster.

Â And so the idea is that here I'm going to create a single point.

Â And then the little plus sign in the middle

Â is kind of like the new location of this merged set of points.

Â 2:52

So now one of that, the next two points that are closes

Â together are 10 and 11 down, down the lower right in the red.

Â And so I'm going to take those two points and merge

Â them together and create a new kind of super point.

Â 3:10

And eventually we will get a little picture here called the

Â Dendrogram, and it shows us how the various points got clustered together.

Â So you can see on the very right side we, there's, there's points five and six

Â that are grouped together, and then in the middle there you got points ten and 11.

Â And so the farther down the tree.

Â In terms of the, the points

Â that are kind of further down the tree are the ones that got clustered first.

Â And the points that are farther up kind of got clustered later.

Â And and so you can see that there are the the kind of points five

Â and six when they got merged together, then

Â they got kind of clustered with point seven.

Â And then when points five, six and seven,

Â they all merged together into a single super point.

Â And they got merged with point 8 and etcetera.

Â And so one of the things

Â about the dendrogram that's produced by the clustering algorithm is that

Â it doesn't actually tell you how many clusters there are, right?

Â You'll notice that there's no specific label on the plot that

Â tells you there are two clusters or three clusters or whatever.

Â And so, what you have to do is you have to cut

Â this tree at a certain point to determine how many clusters there are.

Â And so, for example, if I were to cut

Â it at the point that's labeled on the y-axis, 2.0.

Â For example, if I were to draw a horizontal line.

Â 4:24

At, at the level of 2.0.

Â The question is, you know, how many, how

Â many branches would I come, would I run into?

Â So if I were to draw a horizontal line at 2.0, I would run into about two branches.

Â And that would indicate to me that there are roughly two clusters.

Â However, if I were to draw a horizontal line at the, at

Â hte level of 1.0 there, I'm sorry, 1.0, at the height of 1.0.

Â Then you draw a horizontal line and you'll

Â see you run into three branches there so that

Â would tell you that there were roughly three pluses.

Â So depending on where you want to draw this, you want to draw this horizontal

Â line at the tree you'll get more or fewer clusters in your clustering.

Â Of course, in the extreme case if you were to top the, the, the tree all

Â the way at the bottom you just get

Â 12 clusters, which equals the number of data points.

Â And so you have to cut the tree at a place that's, that's convenient for

Â you, convenient for you.

Â At this point we don't really have a rule of where to cut it

Â but then once you do cut it then you can get the cluster assignment.

Â