In this lecture we'll venture into the topic area of unsupervised learning and talk about two different unsupervised learning or clustering methods, k-means clustering and different types of hierarchical clustering. In unsupervised learning, in contrast to supervised learning, no outcome, target, or dependent variable is present, we just have a collection of data points. The goal is to explore, understand that data, finding patterns or groups in the data rather than trying to make explicit predictions against labels. There are a bunch of examples of different types of unsupervised learning including clustering which attempts to find similarity in structure among groups of data points, principal components analysis which attempts to perform dimensionality reduction, and also has applications to clustering and association rules which attempt to form association between groups or pairs of data points in a dataset. Let's take a look at a taxonomy of unsupervised learning. Two different types of unsupervised learning are dimensionality reduction and clustering. There are different types or different ways to perform dimensionality reduction, and two popular ways are non-negative matrix factorization and principal components analysis. Principal components analysis is a widely used technique and non-negative matrix factorization is a related type of technique that performs dimensionality reduction assuming that all elements in the matrix are positive. The different types of principal components analysis or PCA including kernel PCA and spectral clustering which is an application of PCA to clustering. There are also different types of clustering including hierarchical clustering, and within hierarchical clustering there is single linkage clustering, complete linkage clustering, and average linkage clustering. There's also a type of hierarchical clustering called centroid clustering that is sometimes used in genetics. Hierarchical clustering tries to take data points and form groups or clusters by pulling points together into groups hierarchically. Another way of performing clustering is called partitional, which essentially performs the opposite type of operation, taking an entire dataset and try to divide it or characterize it in terms of smaller groups. There are different types of partitional clustering. One type is based on distributions or mixtures. Another type tries to select modes or means and tries to find the centers or groups of data points. A third tries to identify clusters based on high density areas within a multidimensional space. Finally graph-based clustering tries to represent the data in terms of a graph and forms clusters based on tightly connected components within that graph. There are different types of distribution or mixture based clustering including expectation-maximization or Gaussian mixture models, and expectation- maximization or EM is in fact a generalization of something we'll look at in this lecture called k-means. Generally we can think of k-means as more of a mode seeking type of clustering. But as I mentioned it does have a generalization in expectation-maximization. There are different types of density-based clustering algorithms, a popular one is called mean shift and another one that we'll look at in a subsequent lecture is called DBSCAN. As I mentioned these types of density-based clustering algorithms try to identify groups of points that are densely connected to one another and may not be easy to separate with clean, linear, or even convex boundaries. K-means clustering proceeds in a sequence of steps. First, one has to choose the value of k. Based on that value, one would choose k data points or seeds to be the initial centroids or cluster centers. Importantly you have to know the value of k to get started. Once one has chosen the location of this particular centroids, each data point can be assigned to the closest centroid based on some distance or similarity. Finally after each point is assigned to a centroid, we can re-compute the centroids, and move them based on a current membership of each data point to a cluster. We'll then continually repeat steps 2 and 3 until the clusters have converged. Let's take a look at how this works with a simple example. Step 1 is picking the k cluster centers at random. Step 2 assigns each point to the closest cluster, and we can see here that we've assigned each of these diamond points to either green, red, or orange. Then we compute the centroid locations based on this current data point membership and move each of those cluster centers or centroids to the new center of each cluster. Next we'll reassign each of the points to a different new cluster center if the closest center has changed. Here you can see at this step that a few points had been reassigned to new clusters. We'll re-compute the cluster means and move those means again, and then we'll repeat until we've converged. K-means will stop or converge when there's no or minimum number of re-assignments of data points to different clusters, when there is no change of the centroids, or when there's a minimum decrease in the sum of squared error after each iteration. Where we compute the sum of squared error as the distance between each data point and its respective centroid. Now k-means can fail in a number of different scenarios. Consider clusters that might look like this, where we have a red cluster and a blue cluster that surrounds it. Intuitively with our eyes we can see how we might want to cluster these points, but k-means isn't going to be able to divide them cleanly. Similarly if we have points along a line in one group and a couple of points in a different group along that, and a couple of points off the line. Again, intuitively we can see how we might like to divide those groups into different clusters. But in trying to find two centroids, k-means might divide the clusters according which intuitively is not how we'd like to divide those points into clusters. K-means has some advantages and disadvantages. On the one hand it's simple and efficient but on the other hand you have to know the value of k a priori, it's easy to get stuck in local minima, it's sensitive to outliers, and acquires features that have continuous properties. It's also only possible to group points that are close together in the feature space. We've seen some specific example failure scenarios already in that regard. Now it's worth noting that whereas k-means deterministically assigns each point to a cluster, there is a probabilistic version of it as well called expectation maximization. This is an iterative method for learning probabilistic clustering from unsupervised data, where initially we assume a random assignment of examples to categories. Learn an initial probabilistic model by estimating those model parameters from this randomly labeled data and then follow the following two steps until convergence. In the first step, the expectation step, we would compute the probability of a data point belonging to a particular cluster for each example or data point given the current set of probability distributions for each cluster and then probabilistically re-label each example based on those estimates. After that assignment, we could re-estimate the model parameters data which might include for example a mean or standard deviation from the probabilistically re-labeled data. Then we can repeat the expectation step and so forth just as we did in k-means. In k-means we could think of the expectation step as assigning points centroids and the maximization step as re-computing the centroids based on our assignment points. Another type of clustering is called hierarchical clustering. In contrast to k-means clustering, with hierarchical clustering there's no need to know the number of clusters k in advance, but you do need to tell the clustering algorithm when to terminate. Hierarchical clustering is often represented using a data structure called a dendrogram. All the points in the dataset are represented at the bottom as leaves and closely related points based on a predefined distance metric will be grouped together as shown. Now by adjusting the height of this horizontal line, where we make this particular cut, we can control the number of clusters that result from a hierarchical clustering algorithm. If we drew that dotted line all the way to the bottom, we have n clusters with one point in each cluster, and if we drew it all the way at the top we'd have one giant cluster with all the point. Varying where that cut occurs in the dendrogram, controls how many clusters we have and which points are assigned to each cluster. This is a specific type of hierarchical clustering called agglomerative or bottom-up. Where you start with each data point being at a single cluster merge the closest cluster pairs and eventually as you get to the top you'd end up with all points in the same cluster and you just need to figure out when to stop. There is a less popular form of hierarchical clustering, it starts at the top and works down where all data points belong to the same cluster and eventually each node forms a cluster on its own. One of the properties of hierarchical clustering is that clusters that result from cutting the dendrogram at any given height are nested within the clusters that are obtained by cutting at a greater height. Now not all data necessarily clusters in a hierarchical fashion. If we consider a group of people, male and female with three nationalities, if you wanted two clusters, the best clusters might be on gender whereas if you wanted three clusters the best clusters might occur on nationality. In this scenario, a clustering algorithm like k-means might actually produce a better set of clusters than a hierarchical clustering approach. There are different types of hierarchical clustering algorithms and generally clustering decisions are based on a notion called linkage. Which is the dissimilarity between two groups of observations. We can compute the dissimilarity between different clusters in a number of different ways. Complete linkage computes the maximum of the pairwise dissimilarities between any pair of data points in the two clusters. Single linkage clustering computes the minimum of pairwise dissimilarities between any pair of data points between the two clusters. An average linkage takes the average of these pairwise dissimilarities between any pair of data points in the two clusters. Centroid linkage computes the dissimilarity between the centroids of each cluster. This is less common but it is sometimes used in genomics. Clustering methods are a little bit more difficult to evaluate than supervised learning techniques because there's no ground truth or labels that we can compare against. We're really just comparing the data against itself and trying to figure out how good each one of those clusters is. There are two types of evaluation that are commonly used, one is an internal evaluation, where the clusters are compared against themselves. In an internal evaluation a good clustering method generally produces clusters that have data points where the points within the cluster are more similar to each other than they are to points in other clusters. If you have known values associated with your data, it might be possible to perform a comparison against independent or known value in other words, you can see how your data might cluster with respect to some aspect of your data that you know about, maybe geography or income or some other feature and see whether the clusters that the algorithm produces makes sense. But typically there's no generally accepted technique for the best way to evaluate a clustering algorithm. Ultimately you know it's working if it yields some structure or pattern representation that helps you better understand the nature of your data. In summary, in contrast to supervised learning, in unsupervised learning we only have the feature set and no labels. The purpose of unsupervised learning is to discover structure or patterns in that data. Applications of unsupervised learning include clustering which we looked at in this lecture, as well as dimensionality reduction which we'll look at in the future. Generally unsupervised learning is less well understood than supervised learning in terms of both methods and evaluation, but there are wide variety of methods that exist that are applied in a very broad range of application domains.