[MUSIC] In this video, we are going to discuss a latent variable model for clustering. So what is clustering? Imagine that you own a bank, and you have a bunch of customers, and each of them has some income and some debt. And you want from this data, so you can represent each of your customers on a two dimensional plane as a point. And from this data, you want to decompose your customers into three different clusters. Why? Well, for example, you want to find people who spend money on cars and make some promotions for them, some car related loan or something. This can be useful for different retail companies and banks, and companies like that. So find meaningful subset of customers to work with. And this is an unsupervised problem, so we don't have any labels, we just have raw data, raw x-ses. Usually clustering is done in a hard way, so for each data point we assign it a color. This data point is orange, so it belongs to the orange cluster and this one is blue. Sometimes, people do soft clustering. So instead of assigning each data point a particular cluster, we will assign each data point a probability distribution over clusters. So for example, the orange points on the top of this picture are certainly orange. And they have a probability distribution like almost 100% to belong to their orange cluster and almost 0% to belong to the rest. But the points on the border between orange and blue, they are kind of not settled. They have for example, 40% probability to belong to the blue cluster, and 60% probability to belong to the orange cluster, and 0% to the green. And we don't know which cluster at this point actually belong to. So instead of just assigning each data point a particular cluster, we assume that each data point belongs to every cluster, but with some different probabilities. And to build a clustering methods with this property, we will treat everything probabilistically. Why can we want that? Well, there are several reasons. First of all, we may want to again handle missing data naturally. And another reason, that we may want to consider clustering in probabilistic way, is to tune hyperparamters. So usually, then you want to tune hyperparameters, you do a plot like this. You consider a bunch of different values, for example, for the hyperparameter "number of clusters". So on the previous image, we had free clusters, but we can try some different amount like 5 or 4. And for each of these particular values of number of clusters we may train our clustering model. Which is called GMM and we'll discuss it later in details. So we're going to plot the training performance here like on the blue line and here I'm plotting the log likelihood, so the higher the better. And we can see that whenever we increase the number of clusters, the actually performance in the training set improves. Which kind of the usual thing with hyperparameters. The more clusters you have, the model thinks its better, but it's actually not. So, for example, if you put one cluster per each data point, the model loss will be optimal, but it's not a meaningful solution to the problem at all. So if you consider the validation performance of your model, then it increases, then you start to increase a number of clusters, then it stagnates somehow and then it starts to decrease. And this is the usual picture for tuning hyperparameters. You tune a bunch of models and you chose the one that performs to the best in the validation set. So this was probabilistic model for clustering, but it turns out that you cann't do this thing for hard assignment clustering. Well, at least it's not obvious how to do it. So if you train one of the popular hard clustering algorithmic k-means, it will think that the more clusters you have, the better, both in training and on validation loss. So it doesn't have any meaningful way to understand which number of clusters do we want judging by the performance on the validation set. The probabilistic way of dealing with clustering is also not ideal. So for example here, we're not sure whether we want 20 clusters or 60 or 80, but it gives at least something, you have some boundaries On what is the reasonable value from this hyperparmeter. So this was the first reason why we may want to consider probabilistic approach to clustering. And the second one is that we may want to build a generative model of our data. So if we treat everything probabilistically, we may sample new data points from our model of the data. And in the case of customers, it will mean sample new points on the 2-dimensional grid, that look like the points we used to have in the training set. And if you're points are, for example, images of celebrity faces, then sampling new images from the same probability distribution means generating fake celebrities and their images from scratch. And this is kind of a fun application of building probabilistic model for data. So to summarize, we will want to build a probabilistic model for clustering. And this may help us in two ways. First of all, it may allow us to tune hyper parameters. And it may give us a generative model of the data. So in the next video, we'll build a latent variable model for clustering. [MUSIC]