Congratulations, you made it to the end of the form

of the expectation maximization algorithm.

So let's summarize some of the properties and thinks about this Algorithm.

First of all, Algorithm is a method for training latent variable models.

So if your model has both latent variables, so

not observed, and observed variables x.

Then you can apply Algorithm to train it sometimes more efficiently

than some general purpose optimizational rhythms like stochastic gradient decent.

Well one of the nice properties about the Is that it allows you

to handle missing data.

And sometimes it allows you to do it with your favorite

machine learning algorithms which are not probabilistic.

So in this case, sometimes it's possible to treat your machine

learning algorithm in probabilistic terms.

And then you can say that, for example, if you do not observe your variable x to 3,

for example, so you don't know the value of this feature.

You can say that it's a latent variable.

So all over x's are observed, but this one is latent.

And then you can apply To train this model with latent variables,

to get a reasonable solution in this case.

Well, of course you can always use some heuristics to treat missing values.

For example, you can just throw away some of the columns and

rows to get rid of the missing values.

Or maybe you can substitute the missing values with zeros or some averages.

But in a heuristic sense, really none of this will work.

So although Is not like guarantees you to always work,

usually in practice if you can apply it to your problem with missing values,

is better than just to substitute your missing values with zeroes or

to throw away a bunch of data.

The main idea of the expectation maximization algorithm is

that you want to maximize your margin log likelihood, but it's hard.

It's a hard optimization problem.

So instead of that, you build a variation lower bound,

which depends both on the original parameters theta and on the variational

parameters q which is some of distributions you have just introduced.

And then you're trying to maximize this lower bond with respect to both

theta and q.

In kind of low core and professional.

So in durations you are fix one, you fix q for example and

maximize with respect to theta.

And then you fix theta and maximize with respect to q.

And this way you're, instead of your original complicated optimization problem,

you're solving a sequence of some different optimization problems,

which are in some, in practical cases sometimes much simpler.

So for example, in Gaussian mixture model,

you can solve each of the substep's optimization programs

analytically with almost no timing condition complexity.

The Expectation Maximization Algorithm encourages you to converge to some

critical point maybe not optimal, but at least local maximum or settle point.

And sometimes, it helps you to handle some complicated constraints.

So in the original problem, you may had a constraint like

your matrix sigma should be positive semi-definite.

So it can represent a valid covariance matrix, right?

And it's hard to apply, it's hard to force this constraint on

your favorite stochastic gradient descent, or some other method.

But, in Algorithm,

you're substituting your original problem by a sequence of simpler ones.

And on the nth step you will, of course, also have to enforce this constraint.

But since the optimization task is so much simpler, it's usually much simpler to

enforce these constraints on the nth step, than in the original problem.

So in the Gaussian mixture model case, for example, on the nth step, we have

some concave optimization, which is really much easier than the original one.

And adding this nontrivial positive semi-definitive constraint doesn't

make the problems that much harder.

So we can still solve the nth step analytically.

The expectation maximization algorithm has numerous extensions.

And we will talk about some of them later in this course.

So If your distribution q, so

your Pasteur distribution on the latent variables given the data and

the parameters is too hard to work with, you may do some approximations.

First of all, you can restrict the set of possible qs, restrict the set of

qs you consider, and find the best q In this kind of strategic family.

We'll see a small example of this in the next module but

mostly this will be what week three and week five is about.

Also if you don't want to approximate, if you want something,

at least expect it to be, at least occurs on average.

In this case you can use sampling. And instead of working with the Pasteur

distribution directly, you can try to sample points from it.

And then approximate the expected value on the M-step and solve it approximately.

And this is what week four is about.

How to sample from complicated distributions.

So some negative features of the Is that it gives you a local maximum not

the global one.

But it's kind of expected because it's an problem and

you can't expect anyone to give you optimal solution in reasonable time.

And also it requires a little bit of math which you have to get used to.

But, after you do, after you, kind of, make yourself familiar with these kind of

durations, it becomes very natural to do in your applications and new problems.

[MUSIC]