0:56

So how do you find this?

Â You probably just look at this graph, you probably can see.

Â If you set k as three or even as two,

Â you probably can find it quite reasonable clusters.

Â But if you set as four or five, no matter how smart you are,

Â probably this cluster you find may not be quite stable.

Â So to that extent, you may try to test the cluster's stability,

Â we introduce one method called Bootstrapping approach

Â to find the best value of k, judged based on stability.

Â How can we do this?

Â This Bootstrapping basically says, from D, you take tt samples of size n,

Â but the, the sampling is sampling with replacement.

Â Then it's every time you take almost from the same D, you take it t times.

Â Then you get to the sample D sub 1,

Â D sub to D sub j to D sub t.

Â Okay.

Â Then when you run the same clustering algorithm with k

Â value from two to the maximum of the k you like.

Â Okay.

Â Then you can compare the distance between all the pairs of clustering.

Â For example, for each k, you may try to see whether the D sub i and

Â D sub j, these two different sample datasets.

Â You will see whether finer you get the, the sample,

Â you want to compute the expected pairwise distance for each value of k.

Â That means suppose k you get a ten, you get at least ten clusters,

Â you want to see their pairwise distance.

Â Okay.

Â Then what you can see is this value k,

Â if it exhibits the least deviation between clustering.

Â That means you do clustering those sample D sub i or

Â sample D sub j, they have the least deviation.

Â Okay.

Â That k should be the most stable one.

Â Thus, you may want to have this k value.

Â That's one application to test the cluster stability to find the best k.

Â Actually, there are many methods to find the k, the appropriate number of clusters.

Â One method called empirical method,

Â actually people give you this empirical formula.

Â For example, for total data sets of n points,

Â you may take the square root of half of this n points.

Â Okay.

Â For example, if n is 200,

Â the expected value you will need to get number of clusters should be 10.

Â Of course, this may work out for a small number of points.

Â If you get a really big number of points, for example, you get a,

Â too many points, then you get this value, you'll get a k as a solvent,

Â probably you may not want to get a solvent clusters even for too many points.

Â Another method, people use called elbow method.

Â Elbow method means you tried to based on the number of

Â clusters that go one, two, three, four.

Â You get number of clusters, then you get it, the sum of the within square variance.

Â That means you just to look at the average of within to cluster variance,

Â you try to look at this to see, know what is the best,

Â the number of k that you want to see the elbow point, the turning point.

Â So that's one method.

Â Another method of finding the best number k is use cross validation.

Â That means you divide a given dataset into m parts.

Â So you take one part as a test data, the remaining part to do the clustering.

Â Okay. You can do this m times, so

Â you can check their overall quality of the clustering.

Â Then how we test the, the quality of the examples?

Â Okay.

Â What we do is, is, supposed we use this m minus 1 parts to do the clustering,

Â we find the k clusters.

Â Then for each point in the testing set, you try to find the,

Â it's closest centroid.

Â Then you, based on this, you try to find for all the points in the datasets,

Â you try to find the sum of the square distance.

Â That means you try to find some of the square error SSE, as we introduced.

Â Then usually, you try to see whether this, you get the best fit of the test datasets.

Â That means you get the smallest sum of square distance.

Â Since for cross validation you repeat this m times,

Â so then what you can do is you can compare the overall

Â quality measure with respect to different k's,

Â then you'll find what is the best number to fit the data best?

Â That means you get the overall quality measure,

Â you get the lowest SSE for this particular k.

Â Usually, this is the right number of clusters.

Â [MUSIC]

Â