Welcome back. This is Huan Liu, a Professor of Computer Science and Engineering. We are going to discuss Unsupervised Learning Evaluation. So, let's assume we have many unsupervised learning algorithms. We would like to compare them. So, how can we do that? Or even for one algorithm, if I apply the algorithm to different applications, which application is more suitable for this algorithm? So, we have to answer these questions. Now, the simplest case is this. I already know I have two clusters. The red one, red cross should be in one cluster, and the blue plus should belong to another. So, we have ground truth, but that defeats the purpose. If we already know the two clusters, why should we apply any clustering algorithm? Only under certain circumstances, we really want to verify if this clustering is suitable or not. For example, I have some data sets. I know I have ground truth, but these data sets are similar to those data sets I'm not familiar with. So, I can always try my clustering algorithms on the data sets I have ground truth. Only in this case, it's useful. So, to evaluate is not easy. Why? Because in this case, I have five crosses and a one plus, but I cannot have two pluses and four crosses. So, many combinations, which one's the best? When we have ground truth, actually it's very simple. We can use the accuracy to keep it. In this case, for example, one plus is wrongly classified and two crosses are wrongly clustered in the cluster two. So, I have one plus two divided by six plus eight, equals to three over 14. This is the error rate. But then I also have five plus six divided by six plus eight, 11 over 14. So, you see all together should be 14, add them together. Or in the way, 11 plus 14 should be what? This is the error rate. This is the accuracy. So, as we said, if we already have, we know this kind of thing, already know the ground truth. So, what's the use of clustering? That means we really need to evaluate the cases where ground truth is not available. What can we do when the ground truth is not available? So, we have to go back to our first principles. What is the first principle in clustering case? We define cohesiveness and separateness, these two measures. One is all cohesiveness means all the instances within one cluster should be closer to each other, and separateness means the instances from different clusters should be as much as separate as possible. So, that's why these are two issues. One is more cohesiveness. You have to be close within a cluster instances, within a cluster should be close to each other, and instances that are not in the same cluster should be far away from each other. Cohesiveness. How do we calculate this? Being close to the centroid of the cluster. This is the one way. So, now, let's look at this. This is the cluster, one cluster. This is another cluster. I only have four data points, one, two, one, two, but this is the second cluster, this is the first cluster, first cluster, second cluster. So, for the whole data set, the centroid is zero when we get a very simple case here for calculation. Now, for this cluster, cluster one, the centroid is this one plus this one divided by two. So, that's why it's minus 7.5. Likewise, this centroid is positive 7.5. So, how do we calculate the cohesiveness? So, we calculate the cohesiveness this way. How far is it away from the centroid? So, that's why you see minus 10 minus negative 7.5. Then, again, we need to square it. This is negative five minus negative 7.5 squared. We do the same here. So, it's five minus 7.5 squared and 10 minus 7.5 squared. So, you calculate this, and it becomes 25. Now, we discuss separateness. So, clusters centroids being far from the mean of entire data set. This is one way to calculate. So, now, for this data set, we have this centroid. For this data set, we have this centroid, cluster one, cluster two. Each cluster has its centroid. So, separateness is, I just need to calculate this centroid with this. The entire data is mean and this centroid. With this entire data as me. So, this negative 7.25 minus 0 squared plus 7.5 minus 0 squared. Separateness should be the larger the better. Okay. Because that means the centroids are moving far away from each other. Now, we discuss another measure, it's called silhouette index. So, we are interested in clusters that are both cohesive and separate. So, silhouette index, it compares average distance value between instances in the same cluster to average distance value between instances in the same cluster. This is very much similar to the one we already talked about. But in the well clustered data set, average distance between instances in the same cluster should be small, and the average distance between instances in different clusters should be large. So, we still go with cohesiveness and separateness. How do we do that? So, for any instance x, that is a member of cluster C any instance x. So, we try to find members in the same cluster, but it's not x, we calculate their distance, we add them together, and divide by the number of total number of instances in the cluster C minus one, this is called cohesion. We computed the within cluster average distance, this is the within-cluster for each given cluster. Then computed the average distance between x and C, and instance is in clust G, that's another cluster, okay? G is the closest to 2x in terms of the average distance between x and C, and members of G. Now, why we need to be careful about to this closest to x, because we cannot just assume we only have two clusters, C and G. What if we have more than two clusters? In this such a case, we only need to consider G and C. It's a very smart way to avoid unnecessary calculation. So, now what do we do is this, for x, so G is not equal to C, but at closest to x, then for every instance in G, we calculate the distance. Then we sum them up, this minimum is we find that the closest to G. So, now we have this ax and bx. So, we'll use an example to show you how to calculate this. For silhouette index, our interest is clustering's where a of x should be smaller than b of x. So, basically, this one says instances within the cluster, the value should be smaller than those they are not in the same cluster. Because b calculates the average distance with those are not in C. A is about within cluster C, and B is about C and G. So, silhouette, can take values between one and -101. The best case happens when all x, for all x this equals to 0, and this is greater than, this is just the same as here. This is our goal. So, that's why we have this S for every x because I have many data points. So, I have n data points. So, for every data point, I have to calculate to this, b of x minus a of x, divided by max of this b of x and choose the bigger one in the denominator. So, then we add them altogether. For each instance, we should have this S value, we add them altogether divided by number of total number of instances, that's the silhouette value. Now, let's look at the example, for your convenience I include a of x here definition, and b of x also here, then S here and the silhouette here, then let us see how we calculate. We have four data points, so x11 is here, x11, this is one set, this is second set calculation. Now, let's look at this. So, this one for x11 minus because we only have two data points, right? So, negative 10 minus negative five and squared is 25. For B x11, how do we calculate? That's what we do, B we have to divide the number of data points in G, right? This two, so now that's what we do, negative 10 minus five, then this is the distance squared -10 minus 10 n squared, this is the value, but now for the silhouette what do we do is silhouette is b of x minus a of x, b of x minus a of x, this one minus this divided by the largest amount between the two, so we have this, so the value and 0.29. Now, let's look at one more example. So, let's say we pick this value, x22. We want to work on this x22. So, now, look at this, because we only have one value, so there is no such a thing here, so it's 10 minus five, 10 minus five, 10 minus five here, 10 minus five squared 25. Now, for bx22, we have to calculate with the G, this is this case. So, likewise we have to do is one, two. So, 10 minus negative five. This should be here to next to five, and 10 minus five. These two cases, one and two, two. So, we have 10 minus five, 10 minus negative five,10 minus negative 10, 10 minus negative five, 10 minus negative 10. So, we have these values, and then we calculate we have this decimal. It's a symmetrical so we know these calculations are correct. Then likewise we calculate x 21, x 12, so we have these values, then we add them up, divide it by how many numbers of various instances do we have, we have n equals four, so we divide it by four. So, we can see the value should be somewhere between 0.92 and 0.84. Now, we have a silhouette index to measure the clustering result. That's the end of clustering evaluation or the evaluation of unsupervised learning or both. So, this method can be applied to any clustering algorithms, not necessarily just for k-means. Thank you.