In the previous video, we decided to use Gaussian Mixture Model to fit our dataset and to solve the clustering problem, but how? How can we do better than the generals that castigate and descent? In this video, we will discuss some intuition that leads to expectation maximization algorithm for this particular case. To recall that the density of each data point in our Gaussian Mixture Model is a weighted sum of three or, in general, as many as you want, Gaussians divisions. To start with, we will need to introduce a latent variable here because it will make the reasoning about our model much easier. What can we correlate more variable here? Well, we can do something like this. We can say that each data point was generated by using some information from latent variable T, which exists. It's like we have one latent variable T for each data point X and it causes X. It explains something about X, and the reasonable thing to assume about T here is that it takes 3 radius, 1, 2 or 3 and it shows us from which Gaussian this particular data point came from. We actually don't know for any data point, we don't know from which Gaussian came from. It's a latent variable, right? We doesn't observe this, never nor in training nor in testing, but these can be helpful so if we know the latent variables, it can be helpful to understand something about our model. Later, then with fit the Gaussian mixture model into our data. We may find the distribution on the latent variable even the data. We may ask how will the question, how do you think, what is the value for latent variable T for these particular data point? The answer to this question will be basically the clustering, right? It gives us the beliefs from which Gaussian these data point came from. If we decided to use this kind of latent variable model, then it is reasonable to assume that latent variable T has prior distribution pi, so it's exactly the weights of our Gaussians. The latent variable T equals to some cluster number, for example one, with probability pi one and the likelihood. The density of the data point X, given that we know from which Gaussian came from, is just Gaussian distribution because it came from the Gaussian. Each data point, if we know that it came from the Gaussian number C, it's density just these Gaussian things do with the parameters of the Gaussian number C. When we introduce this kind of latent variable model, we may now look at what will P of X given theta represent in this model. We change our model and we now have to check that it still gives you the same general results as our original model. If you write down P of X given theta, given parameters and this latent variable model, we will get this. It's just a rule of sum for probabilities and it says that a P of X given theta equals to sum with respect to T. So we are marginalizing T, with the respect to all possible values for T, from 1-3, and were assigning the join distribution P of X and T, which equals to the likelihood times the prior. If you substitute the definition of the prior and likelihood into this summation, you can understand that this last formula on this slide is exactly equal to the first formula. We introduced some latent variable in such a way that we when we marginalized out this variable, we get exactly what we had before. Which means that we now can use this latent variable model, train it with observed data X, and it will gives us exactly the same results as we would get if we use the original model. Let's now try to build some intuition about how to train this latent variable model. So say we have a dataset and now, say it's one dimensional. Each data point from 1 to N is just one number for sublist of illustration. How could our goal for finding the maximum likelihood estimation, is finding the parameters, right? How can we find the parameters? Well, it turns out that if we know the sources, if we know the values of the latent variables for each data point, then finding the parameters sigma is easy. Because you're going to say that all the blue data points here is the data points for which we know that they came from the first Gaussian and we can look at them separately from the older orange data points, and just fit them into one Gaussian, which we already know how to do because it's just fitting some data set of blue point into a Gaussian distribution. You can see the formulas on the bottom of the slide, but it's something we have already done in week one, which means that if we know the sources, if we know the values of the latent variables, it's easy to estimate the parameters data. Actually, if we don't have this hard assignments, but rather have soft assignments so some posterior distribution on T, which means that for each data point. We don't assume that it belongs to just one cluster, but rather we assume that it belongs to all clusters simultaneously, all Gaussian simultaneously. What will some different probabilities being posterior P of T, given X and parameters. If we have these probabilities, it is also easy to estimate the parameters. So instead of just signing with respect to all blue points, and averaging their position to get the location of the blue Gaussian, we'll have to sum all the points but with weights. If the posterior P of T = 1, is 1 for some data point? It means that this particular data point is completely blue. It certainly belongs to the blue Gaussian, and it will be used as just blue data point in the averaging with weight 1. If for some data point, P of T = 1 is 0, it means that these data point is certainly orange, and we will just don't use it in the computing the blue Gaussian mean at all, but if the data point is for example, P of T = 1 is 0.8, it just means that this data point is kind of not certain. It thinks that it belongs to the blue Gaussian but it's not sure. It will highly affect the position of the blue Gaussian and a little bit affect the position of the orange Gaussian. We'll direct this kind of formulas later from more strict considerations, but now, just believe me that if you know the posterior probably on T, you could easily estimate the parameters this way. The bottom line here is that if we know the sources, no matter soft segments or hard segments, then we can easily estimate the parameters of our Gaussian mixture model. But on practice, we don't, right? We don't know the sources, so how can we estimate the sources? Well, it turns out that if we know the parameters, so the Gaussians, their locations and their variances, then we can easily estimate the sources because we can use just the Bayes' rule to do it. And by the Bayes' rule, the soft assignment, the posterior probability of T equals to, for example, blue Gaussian, for some particle data point, is just proportional to the join probability, which is likelihood times prior. And if we know the parameters theta, we can easily compute these likelihood and prior at any given point so we can easily to compute this posterior probability, which is basically sources, which is basically assignments for each data point for the clusters, soft assignments, and we think that the normalization constant here can be problematic, but it turns out that we have just two values here. Two probabilities. The distribution P of T given some data and theta, can think just two possible values. We can explicitly normalize this think by signing with respect to two unnormalized probabilities. It's no big deal. To summarize, we have kind of a chicken and egg problem. If we know the Gaussian parameters, we can easily estimate the sources. If we know the sources, we can easily estimate and Gaussian parameters. What can we do in this case? Well, the expectation maximisation algorithm, in this particular case, suggests us to do a very natural thing. Let's, first of all, internalize the parameter somehow randomly and then in iterations in the loop, let's repeat two steps until convergence. First of all, let's fix the parameters. Assume that they are the true ones, and they estimate the sources. On the next sub-step, let's use the sources to re-estimate the parameters, to update the beliefs about the parameters. This way, after repeating this two steps for long enough, we will hopefully obtain a reasonable solution to our probability distribution fitting problem. We will be able to fit Gaussian mixture model into our data.