[SOUND] In this session, we are going to study another issue for clustering assessment called Cluster Stability. Cluster stability basically say, whether you got the right parameter, you get the very stable clustering results. That means you can assume you get the data set D, you take a sample get several datasets from the same data set D, you do the clustering. And you should assume if you're clustering finally get a very stable clustering results, that's a good clustering. Usually, you can use this to find a good parameter values for a given clustering algorithm. For example, you can find a good value of k, which is a good number of clusters. So how do you find this? You probably just look at this graph, you probably can see. If you set k as three or even as two, you probably can find it quite reasonable clusters. But if you set as four or five, no matter how smart you are, probably this cluster you find may not be quite stable. So to that extent, you may try to test the cluster's stability, we introduce one method called Bootstrapping approach to find the best value of k, judged based on stability. How can we do this? This Bootstrapping basically says, from D, you take tt samples of size n, but the, the sampling is sampling with replacement. Then it's every time you take almost from the same D, you take it t times. Then you get to the sample D sub 1, D sub to D sub j to D sub t. Okay. Then when you run the same clustering algorithm with k value from two to the maximum of the k you like. Okay. Then you can compare the distance between all the pairs of clustering. For example, for each k, you may try to see whether the D sub i and D sub j, these two different sample datasets. You will see whether finer you get the, the sample, you want to compute the expected pairwise distance for each value of k. That means suppose k you get a ten, you get at least ten clusters, you want to see their pairwise distance. Okay. Then what you can see is this value k, if it exhibits the least deviation between clustering. That means you do clustering those sample D sub i or sample D sub j, they have the least deviation. Okay. That k should be the most stable one. Thus, you may want to have this k value. That's one application to test the cluster stability to find the best k. Actually, there are many methods to find the k, the appropriate number of clusters. One method called empirical method, actually people give you this empirical formula. For example, for total data sets of n points, you may take the square root of half of this n points. Okay. For example, if n is 200, the expected value you will need to get number of clusters should be 10. Of course, this may work out for a small number of points. If you get a really big number of points, for example, you get a, too many points, then you get this value, you'll get a k as a solvent, probably you may not want to get a solvent clusters even for too many points. Another method, people use called elbow method. Elbow method means you tried to based on the number of clusters that go one, two, three, four. You get number of clusters, then you get it, the sum of the within square variance. That means you just to look at the average of within to cluster variance, you try to look at this to see, know what is the best, the number of k that you want to see the elbow point, the turning point. So that's one method. Another method of finding the best number k is use cross validation. That means you divide a given dataset into m parts. So you take one part as a test data, the remaining part to do the clustering. Okay. You can do this m times, so you can check their overall quality of the clustering. Then how we test the, the quality of the examples? Okay. What we do is, is, supposed we use this m minus 1 parts to do the clustering, we find the k clusters. Then for each point in the testing set, you try to find the, it's closest centroid. Then you, based on this, you try to find for all the points in the datasets, you try to find the sum of the square distance. That means you try to find some of the square error SSE, as we introduced. Then usually, you try to see whether this, you get the best fit of the test datasets. That means you get the smallest sum of square distance. Since for cross validation you repeat this m times, so then what you can do is you can compare the overall quality measure with respect to different k's, then you'll find what is the best number to fit the data best? That means you get the overall quality measure, you get the lowest SSE for this particular k. Usually, this is the right number of clusters. [MUSIC]