Hello, and welcome to lesson four in module 15.

This module will introduce Gaussian mixture models.

Gaussian mixture models is a form of clustering,

that attempts to use Gaussian distributions to represent or model a data set.

This mixture model approach has other benefits beyond

finding clusters including being able to serve as a generative model.

In other words, we construct a parametric representation of our data and we can use

this parameterized model to generate

new artificial data that follow

the same distributions of the data set that we're modeling.

By the end of this lesson, I want you to understand the basic idea behind mixture models.

I also want you to be able to explain how mixture models actually find clusters.

And of course, I want you to be able to apply

a Gaussian mixture model to data by using the scikit-learn library.

This particular lesson has a reading and a notebook.

The reading, is a notebook from Jake VanderPlas' book Python Data Science Handbook.

He introduces the idea of a Gaussian mixture model.

He gives a number of motivating reasons why it might be interesting and important.

The main thing here with Gaussian mixture model is that it's

very similar to k-means and that,

we are finding clusters in data by applying a model.

One of the main differences is,

that we're not restricted to a circular or

spherical cluster as we typically get with k-means.

Gaussian mixture model fits a parameterized Gaussian model.

Which means that we can change the shape based on certain hyper-parameters,

so that's certainly a benefit.

We can also do it very quickly and just like with the k-means,

the same underlying approach,

uses an expectation-maximization algorithm.

So, this just walks you through what a k-means might do.

We get a circular data.

But what if our data is not circular?

The Gaussian mixture model actually can recover that,

that non-circularity if you will.

Until we walks through how this all works.

And, you could stop when you get to the example section.

And then in the mixture model notebook,

we're going to do something similar.

We're going to talk a little bit here about what a parametric model is? Why it's nice?

And then, we step into what the GMM model is?

We apply it to Iris data set.

We talk about number of components,

the covariance, and then we talk about as the same thing but applied to the digit data.

First, the GMM model requires four things.

One, is the number of components or

clusters that you're going to use to build your mixture model.

The second, is a fraction of total data that will

belong probabilistic to each component in the model.

That's not something we have to specify.

That's something that will actually come out of the model.

We also have to have a array that will specify the mean value for

each component and then we also get a covariance matrix for each component.

The technique of using Gaussian mixture models,

uses something called the expectation-maximization algorithm.

It's a very important algorithm,

so I encourage you to read this and try to understand how it actually works.

The hyper-parameters that we're going to use,

typically the most important,

is the end components.

The number of components or clusters that the mixture models you try to fit.

As well as the covariance type,

and that will be best done visually when I show you that.

So, when we actually went to apply this to the Iris data set.

We're going to read the data in, and we're going to scale it,

so that it's all properly normalized.

So, that each feature will be treated the same way.

We're then going to create our Gaussian mixture model.

We're going to say there's three components.

Were cheating a little bit, we know there's three clusters

but we're also going to specify a full covariance type.

We then get a score, in this case,

is quite high, it's point nine.

We can also compute where our clusters are located.

We can then show the data.

A typical data value,

and this is one of the things I wanted to get out.

That, you can say here's a new data element.

What are the probability that it belongs to different mixtures?

And that's what these values here are so you can see quite high.

This data instance belongs to the second cluster.

This data element, belongs to the third cluster and this data element

belongs to the first cluster because these are extremely small values.

That's the idea here, we have a probabilistic representation

of the likelihood that a data belongs to each of the clusters,

not just one but all of them.

We can then plot the cluster centers and you could see not surprisingly.

This one's done quite well,

and then these two are indicated here.

But, we don't have just that we actually have

a probabilistic representation of the space.

So now, we can actually say, "What's

the probability that the data belongs to a given cluster?"

And these two clusters, you could see this sort of combined together.

The only way to really pull them apart,

it's like two mountains that are right next to each other,

is to really do a lot of contours and that

would make it very hard to view. So, I didn't do that.

But you can see, that this is an idea of sort of

the probability space here of belonging to an individual cluster.

We can also say, "Well, how do we change the number of

components or figure out what the right number of components is?"

And with this being a probabilistic technique,

we can apply information theory and in particular,

the Akaike information criteria or the Bayesian information criteria.

Both of these, the AIC or BIC can provide insight.

So we create, compute both of them, and we plot them.

Here is the BIC,

the Bayesian information criteria.

And you can see that lowers better.

And so, these are the two lowest values.

Typically, the reason they start rising or not continuing to decrease very rapidly,

is they get penalized.

The BIC is penalized more heavily.

The AIC is penalized less heavily by complex models.

And again, we see that there's this very sharp drop and then it starts leveling off.

And so, this tells us that we're not getting much by going to

higher components and this one says, we're actually losing.

And so, that sort of gives us some indication that three is a really good fit.

Two also does reasonably well and knowing how those two clusters are sort of merged.

It kind of makes some sense,

why two might also give us a reasonable value.

We can then say, "Well, what's the effect of the covariance terms."

You can change that hyper-parameter to have one or four values.

Spherical, which looks like a k-means clustering.

Diagonal, which means we actually can have

ellipsoidal shapes but different ellipsoidal shapes for each cluster.

We can also do a Tied covariance,

which means each cluster has the exact same covariance or complete generality.

Full, which means each cluster gets its very own covariance matrix.

And if you think about of these ellipses,

sort of represent the shape of the cluster that the mixture finds.

Next, we switch to the digit data.

We perform much of the same analysis that you've been seeing.

So, I don't want to spend too much time on it.

We can recover the images that represent the centers of each of the uncovered clusters.

We can also use the fact that the mixture model can generate new data.

So, we can actually do that here.

These are not actual data from our actual sample.

These are generated images we say,

"Can you generate probabilistic values of

the data from our model and then we turn them into images."

And that's what this code cell does.

Again, we can use the AIC and BIC to figure out what's

the right number of clusters or components to use for this particular example.

And you can see that the AIC drops and has this drop right here,

right around 10 and the BIC is rising up again, right here.

So, the BIC might suggest a lower, a little bit lower value.

But with the AIC, we see that together 10 seems to be a very good value here.

So, that should give you a pretty good introduction to the Gaussian mixture models.

We've also seen a little bit about the EM algorithm.

I encourage you to sort of read through that.

If you have any questions after you've gone through the Gaussian Mixture Model lesson,

please let us know in the course forums.

And of course, good luck.