0:00

[SOUND]

Â Hi, in this session we are going to

Â provide a brief overview of typical clustering methodologies.

Â There are many clustering methods.

Â They can be roughly categorized into a few clustering methodologies.

Â The first method's called distance-based methods.

Â Essentially there are two kinds of methods.

Â One called partitioning algorithms, the other we call hierarchical algorithms.

Â Partitioning algorithms essentially is,

Â partitioning the data in a high dimension space, into multiple clusters.

Â The typical methods include K-Means, K-Medoids,

Â K-Medians, those methods we are going to introduce.

Â For hierarchical algorithms, essentially we can either do an agglomerative,

Â or we call bottom-up, merging many points into clusters,

Â merging small clusters into bigger ones, then from hierarchical clusters.

Â Or we can use divisive methods.

Â Start with a single big, all the data is encased in one cluster,

Â then we try to split them, divide them using the top-down splitting,

Â into smaller and smaller clusters.

Â And then the second methodology, called density-based methods.

Â The third one called grid-based methods.

Â Of course they have some linkages as well.

Â Density-based method essentially is,

Â we can think the database space, made at a high-level granularity,

Â may you think about it with certain grade, or

Â certain structures, or certain k nearest neighbors, we may find certain density.

Â For the dense small regions, we can merge them into bigger regions.

Â In this way, we will be able to find clusters,

Â those events regions, with arbitrary shape.

Â 2:12

The grid-based method,

Â essentially is partitioning the data space into grid-like structures.

Â Each grid is a summary of the characteristics of the data,

Â in the lower grid or lower cells.

Â Then the third method is called probabilistic and generative models,

Â that means we can model data from a generative process.

Â We can assume underneath, there are certain distributions, for example,

Â mixture of Gaussians, okay?

Â Then with the data, we can think the current points

Â are generated from these underlying generative model,

Â or underlying generative mechanisms.

Â Then we can model parameters,

Â using a Expectation-Maximization algorithm we are going to introduce.

Â Based on the available data sets, we may try to find the maximum likelihood fit.

Â 3:49

So, that method can be categorized into bottom-up methods,

Â top-down high-dimensional clustering method, or correlation-based methods.

Â We will also introduce some interesting method, a delta clustering.

Â Then, for high-dimensional clustering,

Â there are many methods developed for dimensionality reduction.

Â Essentially is we can think the height dimension

Â is in a vertical form there, containing lots of columns.

Â And for those columns, when we perform clustering,

Â essentially the columns are clustered, the rows or

Â columns can be clustered together, we call co-clustering.

Â There are several typical methods.

Â One is probabilistic latent semantic indexing.

Â Later, people develop another method called Latent Dirichlet Allocation.

Â So PLSI or LDA, are typical topic modeling methods for text data.

Â Essentially, we can think the text can be clustered into multiple topics.

Â Each topic is a cluster, and each topic

Â is associated with a set of words, or we can think of they are dimensions.

Â And also a set of documents, you can think they are rows, simultaneously.

Â And the second popular study method's called

Â nonnegative matrix factorization, NMF.

Â This is a kind of co-clustering method you can think as,

Â you'll get a nonnegative matrix because the word,

Â a word frequencies in documents are nonnegative, okay?

Â They are zero or more, but they are nonnegative values in the matrix.

Â For this nonnegative matrix, we can approximately

Â factorize it into two nonnegative lower ranked matrices,

Â type U and V, that will reduce the number of dimensions.

Â Another very interesting method we're going to study

Â is called spectral clustering.

Â That means we use a spectrum of the similarity matrix of the data,

Â to perform dimensionality reduction.

Â The higher dimension reduced into the lower,

Â fewer dimensions, then we can perform clustering in fewer dimensions.

Â That's spectral clustering methods.

Â