[MUSIC] In the previous video we decided to solve the clustering problem by making a probabilistic model of our data. But how? How can we model our data probabilistically? Well we don't know that many distributions so far. But we know Gaussians, right? So in week one we learned how to fit the parameters of the Gaussian distribution into the data set. And we can do it here, it's a reasonable thing to do. But it turns out that Gaussian is not the very best model for this kind of data. Recall that we decided that our data consist of several clusters or groups which may be far away from each other. And even with this simple example, we can see that Gaussian has to model all the data points as one big circle or maybe ellipse. And in this case, it just has to assign hyper probability to the center of the circle, it's the way Gaussians works. And you can see here that the center of the Gaussian kind of fall into the region in between the clusters where there are not that many data points. And this is the problem with modeling this kind of data with one Gaussian. We have to model everything with one big circle, and we have some regions in between the clusters where there are few data points. But the Gaussian just has to assign probability to these regions. So, what else can we do? Are there any better probabilistic models we can use here? Well, if one Gaussian doesn't work, let's use several of them, like three for example. So in this case we are kind of assuming that each data point came from one of the three Gaussians, and each Gaussian explains one cluster of the data points. Or if you put it more formally, then the density of each data point equals to the weighted sum of three Gaussian densities, and the weights pi are some non-negative number which sum up to 1 to make an actual probability distribution. So the parameters theta here are the weights pi. And three groups of parameters of the three Gaussians. Their locations, the mu vectors, and their shapes, or their covariance matrices sigma. Note that if we succeed in fitting this kind of model into our data, we will solve clustering problem. Because for each data point, we may now find from which Gaussian this data point came from. And this is exactly the alternative to finding the cluster index. So we may say that all the points that came from one Gaussian are the points of one particular cluster. This model is sometimes called Gaussian Mixture Model, or GMM for short. So great, we have found model which we would like to use. What are some positive and negative features of this model, compared to just plain one Gaussian? Well obviously, Gaussian is much less flexible. So Gaussian Mixture Model allowed us to fit our complicated dataset, and it actually turns out that you may fit just almost any probability distribution with Gaussian Mixture Model with arbitrarily high accuracy. Well of course as always happens with this kind of theorems, in the worst case you may have to use exponentially many Gaussians, so it's not a very practical theorem, but anyway Gaussian Mixture Model is very powerful and flexible model. The downside is obviously the number of parameters. So, as you increase the number of Gaussians you use for your data, you increase the number of parameters by the same factor right? Okay great. We decided to use this kind of Gaussian mixture model. But how can we fit it? How can we find its parameters? So it's pi, mu and sigma vectors and matrices. Well the simplest way to fit a probability distribution is to use maximum likelihood estimation right? So we want to find the values of parameters which maximise the likelihood, which is the density of our data set given the parameters. And we want to maximise this thing with respect to the parameters. As usually in machine learning, we will assume that the data set consists of N data points, which are independent given our parameters. Which basically means that we can factorize the likelihood. So the likelihood equals to the product of likelihoods of individual objects. Well, one more thing we can do to simplify this expression, to understand how to maximize it, is to substitute the definition of the marginal likelihood of xi given the parameters, by using the definition from a few slides before. So each data point density is a mixture of Gaussian densities, right? Okay, so now we have this optimization problem. And just one more thing we forgot here, is that we have some constraints, right? We have to say that the weights pi are non-negative, and that they sum up to 1. Because otherwise, it will not be an actual probability distribution. But now it seems like we're good to go. Now we may use your favorite stochastic optimization algorithm from TensorFlow, like Adam or whatever you would like to use, and we can optimize this thing to find the optimal parameters, right? Well, it turns out that we kind of forgot one important set of constraints here. The covariance matrices sigma cannot be arbitrary. Imagine that your optimization algorithm propose to use covariance matrix with all zeros. It just doesn't work. It doesn't define a proper Gaussian distribution. Because in the Gaussian distribution definition you have to invert this matrix, and you have to compute its determinant, and divide by it. So if you have a matrix which is all 0s, you will have lots of problems like division by 0, and stuff like that. So it's not a good idea to assume that the covariance matrix can be anything. And actually, the set valid covariance matrices is something called positive semi-definite matrices. Don't worry if you don't know what it is, it's not the important right now. The important part is that it's a really hard constraint to follow, so it's hard to adapt your favorite stochastic gradient descent algorithm to always follow this constraint. So to maintain this property that the matrices are always positive semi-definite. I don't know how to do it efficiently, so we have a problem here, we don't know how to optimize this thing, at least with stochastic gradient descent. Well, it turns out that even if you get this constraint, so if you consider a simpler model, for example if you say that all the covariance matrices are diagonal, which means that the ellipsoids that correspond to the Gaussians cannot be rotated. They have to be aligned with the axes. In this case it's much easier to maintain this constraint. And you can actually use some stochastic optimization here. So for example, in this example I used Adam and I tuned its learning rate to optimize this thing. And you can see that it's doing a reasonable job here. So the blue curve is the, performance of the Adam here. On the x-axis we see epochs, and on the y-axis we see log likelihood, which we are trying to maximize. And so, Adam is doing a good job, right? In like 10 epochs it optimized this thing to something reasonable. And the green line here is the ground truth which came from ... because I know from which probability distribution I generated this data, so, I know the optimal value for the log-likelihood. But it turns out that even in this case where we don't have this very complicated constraints you can do so much better by exploiting the structure of your problem. And this is something we're going to discuss in the rest of the week, is something called the expectation maximization (EM) algorithm. And if you apply it here, it just works so much better. In a few iterations it found the value, which is better than the Ground Truth, which probably is overfitting, but anyway it works good on the test set as well. So to summarize, we may have two reasons to not to use this stochastic gradient descent here. First of all, it may be hard to follow some constraints which you may care about, like positive semi-definite covariance matrices. And second of all, expectation maximization algorithm, which can exploit the structure of your problem, sometimes is much faster and more efficient. So as a general summary: we discussed that the Gaussian mixture model is a flexible probability distribution, which can solve the clustering problem for you if you fit your data into this Gaussian mixture model. And sometimes it's hard to optimize with stochastic gradient descent, but there is this alternative which we'll talk about in the next video. [MUSIC]