0:00
[MUSIC]
In the previous video we decided to solve the clustering
problem by making a probabilistic model of our data.
But how? How can we model our
data probabilistically?
Well we don't know that many distributions so far.
But we know Gaussians, right?
So in week one we learned how to feed the parameters of the Gaussian
distribution into the data set.
And we can do it here, it's a reasonable thing to do.
But it turns out that Gaussian is not the very best model for this kind of data.
Recall that we decided that our data consist of several clusters or
groups which may be far away from each other.
And even with this simple example, we can see that Gaussian has
to model all the data points as one big circle or maybe ellipse.
And in this case, It just has to assign hyper mobility to the center of
the circle, it's the way Gaussian works.
And you can see here that the center of the Gaussian kind of fall into
the region in between the clusters where there are not that many data points.
And this is problem with modeling this kind of data with one Gaussian.
We have to model everything with one big circle, and
we have some regions in between the clusters where there are few data points.
But the Gaussian just has to assign hypergrowth to these regions.
So, what else can we do?
Are there any better probabilistic models we can use here?
Well, if one Gaussian doesn't work let's use several of them like three for
example.
So in this case we are kind of assuming that each data point came from one of
the three Gaussians, and each Gaussian explains one cluster of the data points.
Or if you put it more formally then the density of each data point
equals to the weighted sum of three Gaussian densities, and
the weights by are some negative number which sum
up to 1 to make an actual probability distribution.
So the parameters theta here are the weights by.
And three groups of parameters of three Gaussians.
Their locations, the mu vectors, and their shapes, or
their covariance matrices sigma.
Notice that if we succeed in fitting this kind of model into our data,
we will solve clustering problem.
Because for each data point,
we may now find from which Gaussian this data point came from.
And this is exactly the alternative to finding the cluster index.
So we may say that all the points that came from one
Gaussian is the points of one particular cluster.
This models is sometimes called Gaussian Mixture Model, or GMM for short.
So great, we have found model which we would like to use.
What are some positive and negative features of this model,
compared to the just plain one Gaussian?
Well obviously, Gaussian is much less flexible.
So Gaussian Mixture Model allowed us to fit our complicated dataset, and
it actually turns out that you may fit just almost any probability
distribution with Gaussian Mixture Model with arbitrarily high accuracy.
Well of course as always happens with this kind of theorems so in the worst case you
may have to use exponentially many Gaussians, so it's not a very practical
theorem, but anyway Gaussian Mixture Model is very powerful and flexible model.
The downside is obviously the number of parameters.
So, as you increase the number of Gaussians you use for
your data you increase the number of parameters by the same texture right?
Okay great.
We decided to use this kind of Gaussian mixture model.
But how can we fit it?
How can we find its parameters?
So it's bi, mu and sigma vectors and matrices.
Well the simplest way to fit a probability distribution is to use
maximum likelihood estimation right?
So we want to find the values of parameters which maximise the likelihood,
which is the density of our data set given the parameters.
And we want to maximise this thing with respect to the parameters.
It's usually much in learning, we will assume that the data set consists
of n data points, which are independent given our parameters.
Which basically means that we can factorize the likelihood.
So the likelihood equals the product of likelihood of individual objects.
5:02
Well, One more thing we can do to
simplify this expression, to understand how to maximize it,
is to substitute the definition of the marginal likelihood of xi given
the parameters by using the definition from a few slides before.
So each data point density is a mixture of Gaussian densities, right?
Okay, so now we have this kind of optimization problem.
And just one more thing we forgot here is, we have some constraints, right?
We have to say that the weights pi are non-negative, and that they sum up to 1.
Because otherwise, it will not be an actual probability distribution.
But now it seems like we're good to go.
Now we may use your favorite stochastic optimization algorithm from data flow,
like item or whatever you would like to use,
and we can optimize this thing to find the optimal parameters, right?
Well, it turns out that we kind of forgot one important set of constraints here.
The covariance [INAUDIBLE] as a sigma cannot be arbitrary.
Imagine that you have,
that your optimization logarithm proposed to use covariance matrix with all zeros.
It just doesn't work. It doesn't define a proper Gaussian
distribution.
Because in Gaussian distribution definition you have to Invert this matrix,
and you have to compute its determinant, and divide by it.
So if you have a matrix which is all 0s,
you will have lots of problems like division by 0, and stuff like that.
So it's not a good idea to assume that the covariance matrix can be anything.
And actually, the set of valid covariance matrices is something called Positive.
Don't worry if you don't know what it is, it's not the important right now.
The important part is that it's a really hard constraint to follow,
so it's hard to adapt your favorite stochastic rate in descent to
always follow of this kind of restraint.
So to maintain this property that the matrices are always positive,
semi-definite I don't know how to do it efficiently, so we have a problem here,
we don't know how to optimize this thing, at least with stochastic rate and descent.
Well, it turns out that even if you get this constraint, so
if you consider a simpler model, for
example if you say that all the covariance matrices are diagonal, which
means that the ellipsoids that corresponds to the Gaussians cannot be rotated.
They have to be aligned with the axis.
In this case it's much easier to maintain this constraint.
And you can actually use some stochastic optimization here.
So for example, in this example I used Adam and
I tuned it learned grade to optimize this thing.
And you can see that it's doing a reasonable job here.
So the blue curve is the, Performance of the item here.
On the x-axis we see epochs, and on the y-axis we see log likelihood,
which we are trying to maximize.
And so, Adam is doing a good job, right?
It's like ten e-books optimizing this thing to to something reasonable.
I think real line here is the ground for which came from minor because I know
from which probability distribution I generate this data, so,
I know the optimal value for the log likelihood.
But it turns out that even in this case where we don't have this very complicated
constraints you can make so much better by exploiting the structure of your problem.
And this is something we're going to discuss in the rest of the week,
is something called the expectation maximization.
And if you apply it here, it just works so much better.
In like a few iterations it found the value,
which is better than the Ground Truth, which probably is overfitting,
but anyway it works good on the test set also.
So to summarize, we may have two reasons to not
to use this stochastic gradient descent here.
First of all, it may be hard to follow some constraints which you may care about,
like positive semidefinite covariance matrices.
And second of all, expectation maximization algorithm,
which can exploit the structure of your program,
sometimes is much more faster and efficient.
So as a general summary we discussed that the Gaussian mixture model is a flexible
probability distribution, which can solve the clustering problem for
you if you fit your data into this Gaussian mixture model.
And sometimes it's hard to optimize for the stochastic gradient descent, but
there is this alternative which we'll talk about in the next video.