[MUSIC] The first specific clustering algorithm that I will discuss is K-means clustering. This is a very popular, commonly applied, clustering method due to its simplicity. One of the characteristics of this approach is that we start by specifying a number, K, of clusters that we want to identify. And the formulation of this algorithm is to find the partitioning of the objects, which minimizes the total squared Euclidean distance within the objects within clusters. Now hopefully I'm going to make that a little bit more clearer with a specific example. Now the algorithm is an iterative algorithm, which means it repeats a unit until there is some notion of convergence has occurred. So, the algorithm is initiated by first defining k, which is this parameter that defines a number of clusters that we expect, k random cluster centers. Then, each of our objects is assigned or associated with the cluster center which it is closest to, which is most similar. Then we recompute the cluster centers. We shift them by taking every object that is associated with it, each individual cluster center, and averaging them. And this gives us a new position for the cluster center. We then add up all the distances of each object to its cluster center. And if this falls below a particular threshold, then we stop. Or if it is larger than a particular threshold, we repeat the whole process. But if it's lower than a threshold, then we stop and each object is assigned to its closest cluster center. And this defines our map from object to cluster. Now I can make this a little bit more clearer with some images. Okay, let's go back to our point objects from the beginning. So here's our two-dimensional space and we have our 50 points and these are our objects. And let's say we have some reason to believe that we may have three clusters. We're going to initiate the K-means algorithm by randomly assigning three cluster centers. A very common way to do this is to pick, randomly choose three of your objects. And these are indicated as large red dots. So these are our cluster centers. The next step in the K-means process algorithm is to assign each object to its closest cluster center. I've just indicated a few here with arrows. So each object becomes associated with a cluster center. Then, all the objects that are associated with the same cluster are averaged to get a new point. And this point becomes the new cluster center. So each cluster center becomes updated. It shifts to a new position. And we can repeat the whole process again. We can now assign each object to its closest cluster center, and to repeat the whole process. And we do that until there is some notion of convergence, cluster centers stop shifting very much, and the total distance of objects' cluster center falls below a reasonable threshold. Then, at the very end, every object is associated with its closest cluster center. And all the objects associated with the same cluster center define the K-means notion of a cluster. Now I'm going to describe a clustering algorithm from the second class of clustering algorithms. I'm going to describe hierarchical clustering. Now this stands in contrast to K-means clustering in that we don't initially prescribe a number of clusters. Instead what we do here, is we attempt to discover a whole hierarchical organization of clustering. In this way, each level of the hierarchy is composed of a grouping of clusters at lower levels of the hierarchy. In order for this algorithm to work, we need some notion of the dissimilarity between groupings of objects at each level of the hierarchy. And this dissimilarity measure is referred to as the linkage. Now, agglomerative methods to hierarchical clustering start with each individual object in its own group. And then it proceeds to join the groups together until everything is all joined into one unit. And the result of this whole process is a hierarchical tree of groupings of clusterings. So here I'll just go into a little bit more detail about what I mean by the group dissimilarity or linkage. Now in each level of the hierarchical clustering algorithm, we have objects collected into groups or clusters. So we need some notion in order to agglomerate these further in the next level, we need some notion of the similarity or dissimilarity between these groups. There are a number of possible ways that this can be defined. For concreteness, I'll give you just three very simple definitions. First is the single linkage. This simply takes the minimum distance between a pair of objects from each group. And this becomes assigned, this becomes the measure of the dissimilarity between the groups. The second finds the maximum distance between pairs of objects, one from each group. And this maximum distance then becomes the measure of the dissimilarity of the group. And an alternative is to take the mean dissimilarity between objects between each group, and then this again becomes the measure of the dissimilarity between the groups. So with the linkage under our belt, we can then set about the hierarchical agglomerative clustering algorithm. This we start by setting a counter to 1. Our counter, let's call it i. And we assign each object to its own group. Then, we merge the two groups that have the smallest dissimilarity between them. We increment our counter. And then if our counter is equal to N, the number of objects that we have, then we stop. But if it's otherwise, we go back to step two and we recompute. Having recompute the group dissimilarities, we then merge the two groups that have the smallest dissimilarity, and repeat the whole process over and over again, until everything is merged together. Here I can give you a visual example of this process. Let's go back to our, on the right I have a selection of our points from earlier, so our point objects, x1, x2, x3 and x4. Now, at step 0, each object forms its own group. Then next increment of the hierarchical clustering algorithm, we merge the two objects that are closest to each other. This is x1 and x3. These are the objects that are least dissimilar. So you notice here, we've swapped the order. So x1 and x3 are here, and I've indicated their grouping with this bracket. Now, we've not grouped all of our objects together yet, so we continue the iteration. We recompute the dissimilarities. So we calculate the distance between x2. Our objects now are x2, x4, and this group of x1 and x3. Now, x2 and x4 are much closer to each other than either is from this group. So the next step is to join x2 and x4, and that's indicated here with this bracket. Now, again, everything isn't joined together, so we carry on the iteration, iterative process. And we join the two groups, which is indicated with this bracket. So, this is a tree structure that describes the clustering of our objects. And this is the result of hierarchical clustering. So having performed a cluster analysis on our data, we might be interested to gauge the quality of the clusters that we found. There are many different clustering approaches that we can apply and different notions of dissimilarity. We might be interested in gauging the quality of each of these also. There are two broad categories of approaches to this cluster validation. First is internal, which bases the judgement completely on the data at hand. And the second is external in that it brings in prior knowledge of the objects in order to gauge the quality of the clustering. So, I'm going to give you an example of an internal cluster validation technique called the silhouette index. Now the quality of a cluster in the silhouette index, the quality of the cluster is deemed to be high if it's both compact and isolated. Now, the compactness, is defined, for each datum, for each object i, as this value a sub i. And this is simply the average dissimilarity of that object to every other member of the cluster to which it belongs. Now if this is a small value, that means that the cluster indicates that the cluster is compact. The isolation, in terms of this same datum i, is given by b sub i, and this is the minimum mean dissimilarity. So for this datum, we calculate the mean dissimilarity between objects of each of the other clusters, and we take the minimum of those mean dissimilarities. And if this value is large, that indicates that the cluster is very isolated from other clusters. Then the silhouette index value for a given datum is given by the difference between the isolation b sub i and the compactness a sub i, as scaled by the maximum of the two values. And what this gives us is a number for each datum, for each object, it gives us a number between -1 and 1. A negative value indicates the object has probably been assigned to the wrong cluster. But a value close to 1 indicates that this object is a member of a strong cluster. So a cluster that has objects with silhouette index values close to 1 is a very good, strong cluster, and this is how, one way to measure the quality of a cluster. Here's an example of how one might use the silhouette index. Let's say in this case we have 16 clusters. And for each cluster, we're plotting the mean silhouette index value of the objects within that cluster. And we might, for example, apply a threshold and only accept as real high-quality clusters, those that have a mean silhouette index value above that threshold. So, in this way, we can narrow down these 16 clusters into four high-quality compact and isolated clusters. [MUSIC]