0:00

We talked about the problem of learning models with missing variables, missing

data, and discussed some of the complexities of that learning problem.

So now we're going to basically bite the bullet and try and understand how one

might go about doing, say, parameter estimation in the context of the missing

data, nevertheless. So let's remind ourselves of one of the

more important difficulties in learning parameters with missing data.

So let's draw here, the likelihood function that, we might have.

So the X-axis here is the param-, is a pictorial description of the parameter

vector theta. And it's a one-dimensional depiction,

but usually, of course, it'll be multi-dimensional.

And what we see here on the Y-axis is the likelihood function.

In the context of complete data, as as we might remember,

the likelihood function had this really nice form.

It was a nice log concave function. And in the context of Bayesian networks,

it could actually be optimized in closed form using,

using a simple formula. In the context of missing data, however,

the situation became considerably more complicated,

and the likelihood function looks more like this.

It's this nasty, complicated, multi-modal function,

which is really impossible to optimize in closed form.

So how do you go about optimizing a function that looks like this?

There is, it turns out, two general classes of strategies.

The first is to use a generic optimization method, such as gradient

ascent. So here, for example, we have a

pictorial, depiction of gradient ascent. We might start out with a particular

point in the parameter space, and we compute the gradient at that point, and

we go in the direction of steepest ascent, which might take us to the next

point, at which point we compute the gradient, and we continue until we reach,

what is going to be a generally a local optimum of the, of the function that

we're trying to optimize. Now, what I described right now is a very

simple gradient ascent over parameter space.

But, in general, there's more efficient methods for improving the convergence,

the rate of convergence of these algorithms.

Methods such as line search or conjugate gradient methods,

that basically are going to get us faster to whatever global optimum,

or whatever local optimum we're going to end up at.

So a key question when applying gradient adsent to function such as this, is how

difficult it is to compute the gradient in this high dimensional space.

Fortunately, in the context of bayesian networks we can actually come up with a

close form solution for, that gradient and that, solution takes the following

form, so the gradient of, are log likelihood function.

relative to a particular parameter, theta of XI given UI, so this is in the context

of the standard say multinomial Bayesian network and little XI is a particular

assignment to big XI, and little UI is on assignment to its parents.

That gradient has the following form. It's one over the value of the parameter

at the point at which we're computing the gradient, times the summation over data

instances M, of the probability of that particular

assignment x, I, n, u, I to the node and it's parent's,

that particular joint assignment, given all of the evidence given in, the

f, instance, at the current value of the parameter.

3:39

So, we basically look at each and every one of the data instances, compute the

probability of this event, XI comma UI, given whatever observations I happen to

have seen. whether xi and ui are observed or not and

then I compute that probability, add this all up and divide by the value of the

parameter. So that's a nice sort of useable formula

except that we have to understand some of the repercussions of that.

Specifically, this requires that we compute this probability over a node and

its parents, that is a family. given the data m and we need to compute

that for all variables I and for all data instances m.

So, this effectively means that we have to run inference for every instance.

And furthermore we need to compute the probability for every family in the

Bayesian network. Now.

Fortunately we can at least reduce the second part of this computational cost,

which is the fact that we need to run this for all I by remembering the fact

that we have, if we, say, calibrate a clique tree or cluster graph, then when

we're done with the calibration. Because of the family preservation

property, a node and its parents are all going to

be in the same clique or the same cluster.

So a single calibration process is going to give us all of these probabilities.

At once. But that's only true for given data

instance M, which means that effectively what we end up with is the following

computational cost which is that we need to run inference for each and every data

instance at every iteration of the gradient process, which is a very costly

computation. Now, what are the pros of gradient ascent

as a, as an optimization algorithm? it's a very flexible strategy.

So, where as I, we just showed how the formula applies to table CPDs, it can

also be extended to non table CPDs via the chain rule for derivatives.

Which is not the same as the chain rule for Bayesian networks.

5:59

however it also has some significant cons.

first. It's, it turns out to be a little bit

tricky. Not horrible, but a little bit tricky,

to ensure that the parameters define legal CPDs.

So this gives rise to a constrained optimization problem,

where we need to make sure that, the entries in the CPD are all greater than

or equal to zero on sum to one. And that imposes some additional burden

on the optimization problem. And then it turns out that if you want to

get any kind of reasonable convergence, you can't just do straight gradient

ascent. One needs to do something more clever

like conjugate gradient thorough line search which again tends to increase the

computational cost. The second strategy, the one typically

uses for optimizing the likelihood function is a strategy that is

specifically geared for for likelihood functions.

So it's a special purpose algorithm that doesn't optimize any old function but

rather just specifically for likelihood functions.

What's the intuition behind this algorithm called expectation

maximization, or 'em for short? the intuition is that really we have a

chicken and an egg problem here. If we had complete data then parameter

estimation would be easy. Because we know how to do maximum likely

in parameter estimation in Bayesian networks.

In fact it's a closed form problem. Conversely, if we had the full set of

parameters, then computing the probability of the missing data would

also be easy. Easy in the sense that it requires that

it can be done using straight probabilistic inference.

Which may or may not be computationally easy.

But conceptually, computing the probability of the missing data is an

easy, well-defined problem. So, we have two sets of of things that we

need to infer or estimate. One is the parameters.

8:01

Missing data and each of those is easy, given the other.

And so what the ELM method does is effectively an iteration process or, in

fact one can formally show a coordinate assent process where we start out with

one set of parameters, use that to estimate the values of the missing data

and then use that completion to re-estimate the parameters and so on and

so forth, until we reach convergence. So more formally, we pick some starting

point for the parameters, in fact one can also initialize on, on the missing, on

the values of the missing data, it's more standard to op, to, in many applications

to initialize with the parameters but both are completely legitimate.

So say we pick a starting point for the parameters, and then we iterate these two

steps. The first is what's called the E step, E

step stands for expectation, in which we complete the data using the current

parameters. And the second is the M step, standing

for maximization, and that. estimates the parameter is relative to

data completion and maximization comes from basically, something like maximum

like, that's. That's where the M, in maximization comes

from. And it turns out that this iterative

process is guaranteed to improve the likelihood function at each iteration,

that is, at each iteration, the likelihood function is going to be better

than it was before the previous one. and so, this process is an iterative

improvement process. Let's make this process more precise.

So, let's consider each of those, E step and M steps, separately.

So, the expectation or E step, remember this is going from the parameters to the

missing variables. for each data case, D, and each family,

X, U. We are going to compute the probability

of X, U. Given the observed values in in that data

instance, and given the current parameter setting theta t.

With that completion which note is a soft completion.

10:20

Because it's a probability distribution over the parameters, over the missing

variables not a single assignment to x and Du.

With that soft completion we're going to compute what is called the expected

sufficient statistics. The expected sufficient statistics is

very similar to the sufficient statistics that we had before.

when we were computing sufficient statistics we were counting the number of

times that we saw particular combination X comma U, remember we had M.

Of x coma u as being our sufficient statistics.

Now we have a soft completion, and so, we have expectant or soft sufficient

statistics, which we are going to denote with an m hat.

Subscripted by the parameter vector that gave rise to the completion, and N hat

sub theta T for a particular assignment, X comma U, is going to be simply the sum.

Over all the data instances of the probability that we just computed.

The probability of x coma u given the assignment, d sub m, and the parameters,

theta t. And that is a notion of expected counts

or expected sufficient statistics which we're now in the context of the m-step

going to use as if they were real sufficient statistics.

And so we're going to take these expected sufficient statistics and use them to

perform maximum likely destination using the exact same formula that we would have

used had these been real sufficient statistics.

So specifically for example in the context of multinomial Bayesian networks

we would have that the next value of the parameters theta t.

E plus one given XU is N. Halved and bar of theta t.

X u, divided by m bar theta t of u, which is in, using the soft count, the fraction

of x u among the u's. Let's consider, a concrete very simple

example, which is that of, bayesian clustering, so what we see here is your

standard naive base model, where we have a single class variable, and, a bunch of

observed features, and the assumption here is that the, class variable is, is

missing, that is, it's. Sometimes or never.

In most cases, never observed in the data.

And so, effectively, we're trying to identify what are the categories or

classes of instances in this data, under the assumption that, within each

class, the feature values are, are are independent.

That is, the features are conditionally dependent, given the class, as in the

Naive Bayes model. So in substantiating our formula from the

EML algorithm for this setting, we get the following very natural formula.

So the sufficient statistics for for the class variable is simply the sum over all

of the data instances of the probability of the class variable given the observed

features for that data instances, for, for our instance and.

The and the val- current value of the parameters.

That is, it's the soft counts of how many instances fall into each category when

each instance is allowed to divide itself up across multiple categories so to use a

soft assignment. The second set of sufficient statistics,

which represent the dependency between one of the, the features and the class

variable, is M hat sub Theta of XI say comma c.

And that once again sums up over all the data instances and simply looks at the

probability of a particular class variable and a particular feature, a

very, a particular feature given the observed instances, the observed features

at that particular instance. Now with these expected counts, we can

now go ahead and do maximum likelihood estimation.

And that gives rise to the following very simple equations, which is that the theta

C is equal to M hat sub theta of C, divided by M, which is the total number

of data instances. And theta of little X-I given C is going

to be assigned M hat, theta XI comma C, divided by M hat

theta of C, which, again, is the standard, formula, but using soft, rather

than hard, counts. So to summarize the EML algorithm, if we

have in this case the exact same problem that we had in the context of the

gradient assent algorithm, which is that we need to run inference the probability

of xi, c for each data instance of each iteration.

As for the gradient assent algorithm, this only requires a single calibration

because of the same family preservation property that was useful there applies

equally here. What are the pros of the EM algorithm?

it's very easy to implement on top of an existing maximum likely destination

procedure that one might have for complete data.

So for having implemented and MLE procedure, one can easily build on top of

that by adding a separate E step for computing the effect of the expected

sufficient statistics. Empirically, it turns out that the EM

also makes very rapid process, progress in terms of converging to a better and

better set of parameters, especially in early iterations.

And we'll see some empirical results for that in a bit.

The cons of VM algorithm is that the convergence generally slows down at later

iterations. And so it requires sometimes that later

iterations might use say, some kind of conjugate gradient algorithm because it

turns out that that's more effective there.