0:00

Hierarchical clustering is kind of a bread and butter technique

Â when it comes to visualizing a high dimensional or multidimensional data.

Â It's very simple to use.

Â The ideas are fairly intuitive for most people,

Â and it kind of, can serve as a really

Â quick way to get a sense of what's going on in a very high dimensional data set.

Â So, of course, like, like most clustering techniques,

Â 0:22

for example, like, k-means for example the basic issue

Â that you have to deal with, you know, when you define the method is,

Â how do we define when things are close together and when things are far apart?

Â Because all clustering methods in some sense try to tell you

Â when one thing is closer to another thing, versus something else.

Â And so, for example, things that are, you know, cluster are

Â closer to each other than they are to elements of another cluster.

Â And so, we have to define what it means

Â to be close what it means to group things together.

Â How do we group things together?

Â And then once we've got, we have a distance

Â notion, we have, and we have a way of grouping

Â things together, how do we visualize the, you know,

Â the processing that we've done, the calculation that we've done?

Â And how do we interpret you know, how the grouping is put together?

Â Cluster analysis is a really important and widely used technique.

Â It's a, if you just type cluster analysis into

Â Google, there are literally millions of results that come back.

Â And it's a widely applied method in all,

Â in many different areas of science and other applications.

Â And so it's useful to know kind of how these techniques work.

Â 1:25

So, hierarchical clustering as, as is denoted by the

Â title involves organizing your data into a kind of hierarchy.

Â And, and the common approach is

Â what's called an aglomer, and aglomerative approach.

Â And the idea, it was just kind of a bottom up approach, so you

Â start with the data as kind of individual data points, and you start lumping

Â them together into clusters until eventually you have, you

Â know, your entire data set is just one big cluster.

Â So you can imagine there's all these little particles floating around,

Â and eventually you start kind of grouping them together into little balls.

Â And then the balls get grouped up into bigger balls, and

Â the bigger balls get grouped together into one big massive cluster.

Â And so that's the kind of agglomerative approach to clut,

Â to hierarchical clustering, and that's what we're going to talk about here.

Â And so the basic idea is that you gotta find the,

Â the two points in your data set that are the closest together.

Â And then, you kind of merge them together to

Â create one, you know, super-point, so that the merged point

Â is not an actual point in your data set, but

Â you can think of it as being a new point.

Â And then, you kind of remove your original two data points,

Â and you substitute with that, or that kind of, merged point.

Â And you kind of repeat this over and over again.

Â So, then, you find the next two closest things, and you put them together.

Â 2:35

And then you, as you kind of, and then once

Â you kind of put things together, you get like a

Â little tree that you can show that kind of, that

Â shows the ordering in which things were put together.

Â And so this approach requires two important things.

Â One is a distance metric.

Â So what is how do you calculate the distance between two points?

Â 2:53

And, the other thing that is required is an approach for merging points.

Â So once you, once you've figured out, okay well, these two points are the closest

Â how do you merge them together?

Â And so the nice thing that hierarchical clustering produces is a, is a

Â tree which is sometimes called the dendrogram

Â that shows how things are merged together.

Â And so the most important, arguably the most

Â important question to really, to kind of resolve

Â in a, in a hierarchical clustering approach is

Â to define what do we mean by close.

Â Because if you have a definition that doesn't

Â make sense for your problem then you just get

Â garbage in, garbage out, right?

Â So you have a, a distance metric that doesn't make

Â sense then the results that you get will be relatively meaningless.

Â So there are a few examples of distance metrics that are used out there.

Â Probably the most kind of familiar distance metric is the Euclidean distance,

Â which is just kind of the straight-line distance between any two points.

Â 3:46

A kind of analogous distance metric is the correlation between two points, or the

Â correlation similarity.

Â And then another metric is the binary distance or, or the Manhattan distance,

Â and I'll, we'll talk a little bit about what that means in a second.

Â So you have to make, you have to pick a distance metric that's, makes

Â sense for your problem so that you can produce results that also make sense.

Â 4:08

And so, here's Euclidean distance, a simple diagram.

Â So if you take two cities, for example, Baltimore and D.C., or Washington D.C.,

Â then if you map, put them on a, on a, on a map for example.

Â And there's, there's, so, you can imagine that the center

Â of each city has an X and a Y coordinate.

Â 4:24

And you want to say map the distance between the two centers of the two cities.

Â well, you can draw a straight diagonal line between the two cities.

Â And then, the you can calculate the distance, kind of, in the usual way.

Â And the distance is going to be a function of the difference

Â in the x coordinates and the difference in the y coordinates.

Â So to calculate the Euclidian distance you take the distance you know, in the x

Â coordinates, you square it, you take the difference

Â in the y coordinates and you square it.

Â You add the two squares together, you take the square root.

Â That's the classical definition of Euclidian distance.

Â And so you can imagine, you know, in real life, it'd be like

Â if a bird were to fly from DC to Baltimore, they would just fly

Â straight from one city to another, they could, because they can fly in

Â the air and they're not impeded by things like roads or mountains, or whatever.

Â And so

Â that's the straight line distance between two cities.

Â Whether that makes sense for you depends on

Â you know, whether you're a bird or something else.

Â And so you have to think about kind of that in the context of your problem.

Â Now Euclidean distance is easily generalizable to higher dimensions.

Â But even if you have, you know, instead here we

Â just have two dimension, but if you have 100 dimensions, you

Â can easily take the difference between 100, each of the dimensions,

Â square it, sum it together and then take the square root.

Â So it's a very nice type

Â kind of, it, it extends very naturally to very high dimensions problems.

Â [BLANK_AUDIO]

Â The Manhattan distance gets its name from the idea that, you know, you can, you

Â can look at points on a kind of a grid or a city block grid.

Â Think or imagine you're in the city of Manhattan in New York.

Â And you want to get from one point to another.

Â So you can look at the two black circles on this diagram.

Â And if you want to get from one point to another,

Â if you're in a city, you know, in a city

Â block, you can't just go directly from one point to

Â another, because there'd be all these buildings in the way, or

Â you'd have to follow the streets.

Â 6:11

And so in the streets, because they're in a grid,

Â you can either go up or down, or left or right.

Â The green line here in that case would

Â represent the Euclidean distance which would be like if,

Â you know, if you were a bird, and you

Â wanted to fly over everything, across the two points.

Â But as a person walking on the ground, you have to take either the red line,

Â the blue line, or the yellow line, or

Â any other, kind of, kind of pattern in between.

Â The point is that you'd have to travel along

Â the city blocks.

Â And so, the distance that you would travel along the city blocks here, in the

Â gray, kind of, you can imagine the gray

Â lines as the streets, that's the Manhattan distance.

Â And it, it can written formulaically as the,

Â as the absolute sum of all the different coordinates.

Â So if you have two coordinates, you know, and you

Â sum the distance that you travel in the X dimension,

Â and then you sum, and then the distance, the absolute

Â value of the distance you travel in the Y direction.

Â And so the Manhattan distance can be useful in some contexts, and, and so and,

Â because it more, it may, in some contexts,

Â more accurately represent the distance between two points.

Â In particular, in a city like Manhattan, two points may be very far apart.

Â