0:00

We talked about the problem of learning models with missing variables, missing

Â data, and discussed some of the complexities of that learning problem.

Â So now we're going to basically bite the bullet and try and understand how one

Â might go about doing, say, parameter estimation in the context of the missing

Â data, nevertheless. So let's remind ourselves of one of the

Â more important difficulties in learning parameters with missing data.

Â So let's draw here, the likelihood function that, we might have.

Â So the X-axis here is the param-, is a pictorial description of the parameter

Â vector theta. And it's a one-dimensional depiction,

Â but usually, of course, it'll be multi-dimensional.

Â And what we see here on the Y-axis is the likelihood function.

Â In the context of complete data, as as we might remember,

Â the likelihood function had this really nice form.

Â It was a nice log concave function. And in the context of Bayesian networks,

Â it could actually be optimized in closed form using,

Â using a simple formula. In the context of missing data, however,

Â the situation became considerably more complicated,

Â and the likelihood function looks more like this.

Â It's this nasty, complicated, multi-modal function,

Â which is really impossible to optimize in closed form.

Â So how do you go about optimizing a function that looks like this?

Â There is, it turns out, two general classes of strategies.

Â The first is to use a generic optimization method, such as gradient

Â ascent. So here, for example, we have a

Â pictorial, depiction of gradient ascent. We might start out with a particular

Â point in the parameter space, and we compute the gradient at that point, and

Â we go in the direction of steepest ascent, which might take us to the next

Â point, at which point we compute the gradient, and we continue until we reach,

Â what is going to be a generally a local optimum of the, of the function that

Â we're trying to optimize. Now, what I described right now is a very

Â simple gradient ascent over parameter space.

Â But, in general, there's more efficient methods for improving the convergence,

Â the rate of convergence of these algorithms.

Â Methods such as line search or conjugate gradient methods,

Â that basically are going to get us faster to whatever global optimum,

Â or whatever local optimum we're going to end up at.

Â So a key question when applying gradient adsent to function such as this, is how

Â difficult it is to compute the gradient in this high dimensional space.

Â Fortunately, in the context of basion networks we can actually come up with a

Â close form solution for, that gradient and that, solution takes the following

Â form, so the gradient of, are log likelihood function.

Â relative to a particular parameter, theta of XI given UI, so this is in the context

Â of the standard say multinomial Bayesian network and little XI is a particular

Â assignment to big XI, and little UI is on assignment to its parents.

Â That gradient has the following form. It's one over the value of the parameter

Â at the point at which we're computing the gradient, times the summation over data

Â instances M, of the probability of that particular

Â assignment x, I, n, u, I to the node and it's parent's,

Â that particular joint assignment, given all of the evidence given in, the

Â f, instance, at the current value of the parameter.

Â 3:39

So, we basically look at each and every one of the data instances, compute the

Â probability of this event, XI comma UI, given whatever observations I happen to

Â have seen. whether xi and ui are observed or not and

Â then I compute that probability, add this all up and divide by the value of the

Â parameter. So that's a nice sort of useable formula

Â except that we have to understand some of the repercussions of that.

Â Specifically, this requires that we compute this probability over a node and

Â its parents, that is a family. given the data m and we need to compute

Â that for all variables I and for all data instances m.

Â So, this effectively means that we have to run inference for every instance.

Â And furthermore we need to compute the probability for every family in the

Â Bayesian network. Now.

Â Fortunately we can at least reduce the second part of this computational cost,

Â which is the fact that we need to run this for all I by remembering the fact

Â that we have, if we, say, calibrate a clique tree or cluster graph, then when

Â we're done with the calibration. Because of the family preservation

Â property, a node and its parents are all going to

Â be in the same clique or the same cluster.

Â So a single calibration process is going to give us all of these probabilities.

Â At once. But that's only true for given data

Â instance M, which means that effectively what we end up with is the following

Â computational cost which is that we need to run inference for each and every data

Â instance at every iteration of the gradient process, which is a very costly

Â computation. Now, what are the pros of gradient ascent

Â as a, as an optimization algorithm? it's a very flexible strategy.

Â So, where as I, we just showed how the formula applies to table CPDs, it can

Â also be extended to non table CPDs via the chain rule for derivatives.

Â Which is not the same as the chain rule for Bayesian networks.

Â 5:59

however it also has some significant cons.

Â first. It's, it turns out to be a little bit

Â tricky. Not horrible, but a little bit tricky,

Â to ensure that the parameters define legal CPDs.

Â So this gives rise to a constrained optimization problem,

Â where we need to make sure that, the entries in the CPD are all greater than

Â or equal to zero on sum to one. And that imposes some additional burden

Â on the optimization problem. And then it turns out that if you want to

Â get any kind of reasonable convergence, you can't just do straight gradient

Â ascent. One needs to do something more clever

Â like conjugate gradient thorough line search which again tends to increase the

Â computational cost. The second strategy, the one typically

Â uses for optimizing the likelihood function is a strategy that is

Â specifically geared for for likelihood functions.

Â So it's a special purpose algorithm that doesn't optimize any old function but

Â rather just specifically for likelihood functions.

Â What's the intuition behind this algorithm called expectation

Â maximization, or 'em for short? the intuition is that really we have a

Â chicken and an egg problem here. If we had complete data then parameter

Â estimation would be easy. Because we know how to do maximum likely

Â in parameter estimation in Bayesian networks.

Â In fact it's a closed form problem. Conversely, if we had the full set of

Â parameters, then computing the probability of the missing data would

Â also be easy. Easy in the sense that it requires that

Â it can be done using straight probabilistic inference.

Â Which may or may not be computationally easy.

Â But conceptually, computing the probability of the missing data is an

Â easy, well-defined problem. So, we have two sets of of things that we

Â need to infer or estimate. One is the parameters.

Â 8:01

Missing data and each of those is easy, given the other.

Â And so what the ELM method does is effectively an iteration process or, in

Â fact one can formally show a coordinate assent process where we start out with

Â one set of parameters, use that to estimate the values of the missing data

Â and then use that completion to re-estimate the parameters and so on and

Â so forth, until we reach convergence. So more formally, we pick some starting

Â point for the parameters, in fact one can also initialize on, on the missing, on

Â the values of the missing data, it's more standard to op, to, in many applications

Â to initialize with the parameters but both are completely legitimate.

Â So say we pick a starting point for the parameters, and then we iterate these two

Â steps. The first is what's called the E step, E

Â step stands for expectation, in which we complete the data using the current

Â parameters. And the second is the M step, standing

Â for maximization, and that. estimates the parameter is relative to

Â data completion and maximization comes from basically, something like maximum

Â like, that's. That's where the M, in maximization comes

Â from. And it turns out that this iterative

Â process is guaranteed to improve the likelihood function at each iteration,

Â that is, at each iteration, the likelihood function is going to be better

Â than it was before the previous one. and so, this process is an iterative

Â improvement process. Let's make this process more precise.

Â So, let's consider each of those, E step and M steps, separately.

Â So, the expectation or E step, remember this is going from the parameters to the

Â missing variables. for each data case, D, and each family,

Â X, U. We are going to compute the probability

Â of X, U. Given the observed values in in that data

Â instance, and given the current parameter setting theta t.

Â With that completion which note is a soft completion.

Â 10:20

Because it's a probability distribution over the parameters, over the missing

Â variables not a single assignment to x and Du.

Â With that soft completion we're going to compute what is called the expected

Â sufficient statistics. The expected sufficient statistics is

Â very similar to the sufficient statistics that we had before.

Â when we were computing sufficient statistics we were counting the number of

Â times that we saw particular combination X comma U, remember we had M.

Â Of x coma u as being our sufficient statistics.

Â Now we have a soft completion, and so, we have expectant or soft sufficient

Â statistics, which we are going to denote with an m hat.

Â Subscripted by the parameter vector that gave rise to the completion, and N hat

Â sub theta T for a particular assignment, X comma U, is going to be simply the sum.

Â Over all the data instances of the probability that we just computed.

Â The probability of x coma u given the assignment, d sub m, and the parameters,

Â theta t. And that is a notion of expected counts

Â or expected sufficient statistics which we're now in the context of the m-step

Â going to use as if they were real sufficient statistics.

Â And so we're going to take these expected sufficient statistics and use them to

Â perform maximum likely destination using the exact same formula that we would have

Â used had these been real sufficient statistics.

Â So specifically for example in the context of multinomial Bayesian networks

Â we would have that the next value of the parameters theta t.

Â E plus one given XU is N. Halved and bar of theta t.

Â X u, divided by m bar theta t of u, which is in, using the soft count, the fraction

Â of x u among the u's. Let's consider, a concrete very simple

Â example, which is that of, basion clusterng, so what we see here is your

Â standard niave base model, where we have a single class variable, and, a bunch of

Â observed features, and the assumption here is that the, class variable is, is

Â missing, that is, it's. Sometimes or never.

Â In most cases, never observed in the data.

Â And so, effectively, we're trying to identify what are the categories or

Â classes of instances in this data, under the assumption that, within each

Â class, the feature values are, are are independent.

Â That is, the features are conditionally dependent, given the class, as in the

Â Naive Bayes model. So in substantiating our formula from the

Â EML algorithm for this setting, we get the following very natural formula.

Â So the sufficient statistics for for the class variable is simply the sum over all

Â of the data instances of the probability of the class variable given the observed

Â features for that data instances, for, for our instance and.

Â The and the val- current value of the parameters.

Â That is, it's the soft counts of how many instances fall into each category when

Â each instance is allowed to divide itself up across multiple categories so to use a

Â soft assignment. The second set of sufficient statistics,

Â which represent the dependency between one of the, the features and the class

Â variable, is M hat sub Theta of XI say comma c.

Â And that once again sums up over all the data instances and simply looks at the

Â probability of a particular class variable and a particular feature, a

Â very, a particular feature given the observed instances, the observed features

Â at that particular instance. Now with these expected counts, we can

Â now go ahead and do maximum likelihood estimation.

Â And that gives rise to the following very simple equations, which is that the theta

Â C is equal to M hat sub theta of C, divided by M, which is the total number

Â of data instances. And theta of little X-I given C is going

Â to be assigned M hat, theta XI comma C, divided by M hat

Â theta of C, which, again, is the standard, formula, but using soft, rather

Â than hard, counts. So to summarize the EML algorithm, if we

Â have in this case the exact same problem that we had in the context of the

Â gradient assent algorithm, which is that we need to run inference the probability

Â of xi, c for each data instance of each iteration.

Â As for the gradient assent algorithm, this only requires a single calibration

Â because of the same family preservation property that was useful there applies

Â equally here. What are the pros of the EM algorithm?

Â it's very easy to implement on top of an existing maximum likely destination

Â procedure that one might have for complete data.

Â So for having implemented and MLE procedure, one can easily build on top of

Â that by adding a separate E step for computing the effect of the expected

Â sufficient statistics. Empirically, it turns out that the EM

Â also makes very rapid process, progress in terms of converging to a better and

Â better set of parameters, especially in early iterations.

Â And we'll see some empirical results for that in a bit.

Â The cons of VM algorithm is that the convergence generally slows down at later

Â iterations. And so it requires sometimes that later

Â iterations might use say, some kind of conjugate gradient algorithm because it

Â turns out that that's more effective there.

Â