And then 3 merged with those, and 4 and 5, and so

we get the same sort of cluster that we had in this example.

So this is how we would walk through a single link clustering example with

the similarity matrix.

So remember, with the similarity matrix, we're trying to find the max similarity.

If we had had this as a distance matrix it would be find the minimum distance.

So it just depends if you're defining distance or similarity.

Now, a single link clustering,

there's a whole bunch of different ways this can look.

So here's another example and, with single-link clustering,

we're going to wind up with these nested sort of structures where 2 and

5 are the closest to each other, 3 and 6, then those are closest.

Then again we might end up getting 4, then number 1 so we get this sort of structure

where the closes things merge and you get this sort of nested structure out.

Now this is great if you expect some sort of nested structure in your data set, but

it doesn't always mean that this is the best sort of choice.

So the strengths to single link clustering are, if this our original points,

it's going to wind up splitting these into two nice clusters.

So we got a blue cluster and a red cluster.

So singling cluster will work really well with this if we have sort of this split,

and it can handle non-elliptical shapes.

The limitations though are,

if we have any sort of overlap here we're going to get these weird.

Elongated sorts of clusters, and it's become sensitive to noise and outliers.

If you look really closely here, these points all the way out here become blue,

and so we have a blue cluster that goes sort of all the way here,

we have a red cluster that goes to here.

So we have a whole bunch of weird overlap that occurs due to this noise and

outliers that can produce these long, elongated clusters.

So it's not good at handling this sort of noisy data structure.

So if you think your data has some clear separations, it will work pretty well.

If you have some noise and overlap,

single-link agglomerate of clustering will have problems.

Now we don't have to do single-link though,

we can actually do what's called complete-link.

Complete-link and the complete-link distance between clusters is the maximum

distance between any object, or the smallest similarities between them.

So distance is defined by the two most dissimilar objects, and so

we can walk through that exact same dataset now with complete-link clustering.

So, where students get confused is in the selection choice.

So with single link clustering, we're always taking the maximum similarity.

In complete-link clustering we're taking the minimum similarity.

That's only to fill out the distance matrix.

If the distance matrix is already, once we fill out the distance matrix,

we are taking highest similarity here.

So, the higher similarity is still 1 and 2.

So 1 and 2 get merged first no matter what is in this example.

So now, our distance metrics becomes I1, I2,

[COUGH] I3, I4 and I5, but where the choice

between complete and single link comes in,

is going to be in this step on how we fill in the values here.