In the last session of this lecture, we are going to introduce probabilistic hierarchical clustering. What is probabilistic hierarchical clustering? Let's first examine the problems of algorithmic hierarchical clustering. The first problem for algorithmic hierarchical clustering is it is nontrivial to choose a good distance measure. For example, we already see the single link, the complete link, the average link and the central link. They're all different measures, but it's hard to choose a good distance measure based on the application problems. The second problem is it is hard to handle missing attribute values because when the attribute value is missing, it may impact on the quality of clustering. The third one is the optimization goal of algorithmic hierarchical clustering is not so clear because it is a heuristic algorithm, it more relies on local search. Now, there is another approach called probabilistic hierarchical clustering. This method essentially uses probabilistic models to measure distance between clusters. It is largely a generative model which means it regards the set of data objects to be clustered as a sample of the underlying data generation mechanism to be analyzed. And then you finally find the parameter, find out how they generate this data set. It is easy to understand. [COUGH] It has the same efficiency as algorithmic agglomerative clustering method and, also, it can handle partially observed data. In practice, we can assume the generative models adopt common distribution functions. For example, we can assume it's a Gaussian distribution or Bernoulli distribution, and governed by a set of parameters. Let's look at the generative model. In the generative model, we can assume, given a set of one-dimensional points, X, we can assume they are generated by a Gaussian distribution like this. That means you're essentially using different mu and sigma. You may generate the distribution according to the typical, normal distribution or Gaussian distributing formula. Then, the probability a single point, x sub i, which is one point in the set of X, is generated by the model. Essentially, it's what's the probability to generate x sub i under the condition of mu and sigma is based on this formula. Then for the likelihood that this whole set of X generated, essentially, is the likelihood for the normal distribution to generate this set of x points. Essentially, it's the probability of a set of points, x, under the condition of mu and sigma squared. For i from 1 to N, all of these probability multiplying together we get the probability generated at the data set, x. Then the task of learning this generative model, essentially, we try to find the parameters, a mu and a sigma. So for this particular distribution, we try to find the maximum likelihood this set of x has generated, according to these two parameters. So the Gaussian distribution usually, I think most people know, for example, for a bean machine, if you drop a ball with pins, if you use many, many balls, finally you will find the distribution is a Gaussian distribution. Then if we have may points, we can assume these points are actually generated by the Gaussian distribution like this. Actually, for a one-dimensional Gaussian, this different mu, different in mean and different of variance, you probably can see that it generate different distribution. For a two-dimensional Gaussian, an image generated in 3-D space, you probably can clearly see the distribution of this 2-D Gaussian. Then a probabilistic hierarchical clustering algorithm, essentially, is suppose we regard the partition, the whole set of points, into m clusters, C sub one to C sub m. Then the quality of the clustering can be measured using this quality function that's the product of all those probabilities. Then if we want to merge two clusters, C sub j1 and C sub j2, then the change in quality of the overall clustering is one criterion. So you probably can see this is the merged cluster, your generating new cluster, and this originally two cluster disappear from the original set of clusters. So that's the new clustering, that's the original clustering, because if you merge this, essentially, the sum of the square arrows will be increasing. So that's the reason you're trying to minimize that increasing. That's the probability of P(C sub i). Now, you get the probability of these two merged together, okay. The you have to minus the original probability, i from 1 to m. Then the distance between clusters C1 and C2 is, this is the probability of C1 and C2 together as one cluster, these are the independent clusters. If this distance is less than zero, we should merge these two clusters. So that's the idea, or the essence, of the probabilistic hierarchical clustering algorithm. [MUSIC]