0:00

Now, we have tackled the problem of learning a model structure or parameters

Â from the case of complete data. We're now going to move to what turns out

Â to be a much harder situation where we're trying to learn when we have only

Â partially observed data. The fact that this arises in a variety of

Â settings. It arises when we have a scenario where

Â we, where some variables are just never observed.

Â They're hidden or latent. It also occurs where some variables are

Â just missing, because some measurements weren't taken.

Â It turns out as we'll see, that these settings provide significant challenges

Â both in terms of the foundations, defining the learning task in a

Â reasonable way and from a computational perspective, where as we'll see the

Â computational issues that arise in this incomplete data setting are considerably

Â more challenging. I mentioned latent variables.

Â Let's try to argue why we might care about latent variables.

Â So one reason is that latent variables can often give rise to the sparser and

Â therefore easier to learn models. So let's imagine that this is my true

Â network. G star, where we have three variables

Â leading into this variable, H, and then the three variables at the bottom, and if

Â all variables are binary, then this is a network that can be parameterized with

Â seventeen independent parameters. But now let's imagine that I've decided

Â that H is latent and I'm just going to learn a network over the observable

Â variables, which are the x's and the y's. And so what is the network that correctly

Â captures the structure of the distribution P over x1, x2 x3, y1, y2,

Â and y3? And it turns out that this network, if

Â you think about it, has, burst, because h is not there.

Â And edge, from every x to every y. And furthermore.

Â Because the y's are no longer conditionally independent given the x's

Â because there only conditionally independent given the H that I don't

Â observe and I have also edges between the y's directly so the spaghetti actually

Â turns out to look like this with a total of 59 parameters in the network.

Â So by dropping this one latent variable, I've created a model that is much harder

Â to learn. Now of course, learning a model with the

Â latent variables is by itself a a problematic situation but it may well

Â be worth the tradeoff. So the other reason why we might care

Â about learning latent variables is because they might be interesting.

Â They might provide us with an interesting characterization of structure in the

Â data, and I'll give you details of that in a later module but for the moment just

Â as a teaser, imagine that we have a data set corresponding to 3D point clouds,

Â that are scanned of a human body and we would like to discover from that what

Â are, what is the limb strcuture of the person, that, to which the scans

Â correspond, that is we want to identify clusters in the data, clusters in the

Â point cloud that correspond to body parts.

Â And so we want to basically end up with an output where each point has a latent

Â variable representing which body part it belongs to.

Â 3:52

So, having motivated why we might care about missing data, let's think about

Â some of the complexities that arise. So, let's imagine that somebody gives us

Â this sequence over here and says, you know, here's these question marks that

Â correspond to missing data. How do we treat this.

Â And the answer is, if you don't know why these data are missing you have no ideal

Â how to proceed. And so to understand, let's consider two different scenarios.

Â The first one, is an experimenter is asked to toss the coin and occasionally

Â the, the coin misses the table and drops on the floor and the experimenters, you

Â know too tired to, to go crawl under the table to see, what happens is, so they

Â don't record the value of the coin in the cases where it fell on the floor.

Â Case two is the coin is tossed, but, the experimenter doesn't like tails.

Â For some reason, tails are, are, you know, give them the hebeegeebez and so

Â tails are not reported sometimes. Note in these two cases really should

Â give rise to very different estimation procedures if we are trying to learn from

Â this data set. Specifically in the first case, we should

Â probably just ignore the question marks and just learn from the sequence of

Â observed instances, HTHH because the other ones the fact that are missing, it

Â doesn't tell us anything about the point. In case two, on the other hand, we can't

Â really ignore the missing measurements. We need to learn from the sequence H, T,

Â T, T, H, T, H, because ignoring the missing values is effectively ignoring

Â something that is, predominantly or entirely tails, and so we would get

Â incorrect estimates if we just ignored them.

Â 6:25

And zero otherwise. And so we always know whether we observe

Â the variable or not. And so OI is always observed.

Â And now we're going to add a new set of random variables, which are also always

Â observed. These are the variables that we're going

Â to call YI, which have the same value as XI, so each one has the same value space

Â as XI, except that there is also the I didn't get to observe it value, and so in

Â the real data case we, in the re, in the real scenario we basically get to observe

Â the Y's, we get to observe the O's and the X's are not observed.

Â Now, the y is our deterministic function of the xes and the o's.

Â So y is equal to, yi is xi when o is observed.

Â N is= to? When o is not observed.

Â So in the cases where, where, the val-, where I have oi is= to one.

Â I can reconstruct the value of xi. But for the cases where I, I don't have

Â the observation, I can't. And so this is a way of just defining

Â the, the observability, pattern that I have.

Â With this set of variables I can now model the two, the two different

Â scenarios that we had before. In this scenario, which corresponds to

Â the coin falling on the ground every once in a while, we have a separate model over

Â here that represents our observability pattern, and we see that a variable is

Â sometimes observed by chance, and that the target and observed value Y depends

Â on X and on O but there is no interaction between the value of the coin and whether

Â I end up observing it or not. By comparison, in the case where the

Â experimenter doesn't like fails, we see that the x, that the true value of the x

Â variable, effects whether it's observed or not.

Â And so we have edge from x to o. So when can, in which of these cases can

Â we ignore the missing data mechanism and focus only on the likelihood of the stuff

Â that I get to observe? And the answer is, one can define a

Â notion called missing at random. Missing at random is the way for me to

Â say, I can ignore the mechanism for the observability and focus only on this

Â place over here. So one can show that it suffices for this

Â question, for focusing only on the likelihood that this distribution over x

Â and y and o have the following characteristics that the observation

Â variables o are independent of the unobserved X's,

Â 9:47

which we're going to denote h, given the observed values y, which are my data

Â instances. Which means that if you tell me the

Â values that you observe, then the fact that something may or may not have been

Â observed doesn't carry any additional information.

Â And this is a little bit of a tricky notion, so let's try and give an example.

Â Imagine that a doctor, a patient comes into the doctors office, and the doctor

Â chooses what set of tests to perform. For example, the doctor chooses, to

Â perform or not perform, say, a chest x-ray.

Â The fact that the doctor didn't choose to perform a chest X Ray probably in the

Â case that the person didn't come in with a deep cough or some other symptoms that

Â suggested tuberculous or phenomena. And therefore the test wasn't performed.

Â So the observation or lack there of, of a chest x ray,

Â the fact that a chest x ray doesn't exist in my patient record is probably an

Â indication that the patient didn't have tuberculous or pneumonia.

Â So these are not independent. So in that model we do not have the

Â missing it random, assumption holding because we the observe ability pattern

Â tells me something about the disease which is the unobserved variable that I

Â care about, on the other hand if I have in my medical record things like the

Â primary complaint that the patient came in, for example, a broken leg.

Â Then, at that point, given that the primary complaint was a broken leg I

Â already know that the patient likely didn't have tuberculous or pneumonia and,

Â therefore, given that, observed feature, observed variable which is the primary

Â complaint, the observability pattern no longer gives me any information about the

Â variables that I didn't observe. And, so that is the difference between a

Â scenario that is missing at random and a scenario that isn't missing at random.

Â For the for the for the purposes of our discussion we're going to make the

Â missing at random assumption from here on.

Â What's the next complication, with the case of incomplete data?

Â It turns out that the likelihood can have multiple, global maximum.

Â So, intuitively, that's almost, almost obvious.

Â Because if you have a hidden variable. That has two values, zero and one.

Â The values zero and one don't mean anything.

Â We could rename them one and zero and just invert everything.

Â And it would, basically, give us an exactly equivalent model to the one with

Â 01, because the names don't mean anything.

Â And so, that immediately means that I have a reflection of my likely hood

Â function that occurs when I rename the variables.

Â And it turns out that this is not something that happens just in this case,

Â when they have multiple hidden variables the problem only becomes worse because

Â the number of local... The number of global maximum becomes

Â exponentially large in the number of hidden variables.

Â And so now we have a function with exponentially many reflections of itself,

Â and it turns out that this can also occur when you have missing data not just with

Â hidden variables. So, even if all I have are data where,

Â where only some occurences of the variable are missing its value even that

Â can give me multiple local and global maximum.

Â So to understand that a little bit in more depth lets go back to the

Â comparisons between the likelihood in the complete data case and the likelihood in

Â the incomplete data case. So here is a simple model where I have

Â two variables x and y with x being a parent of y.

Â And I have three instances, and if we just go ahead and write down the complete

Â data likelihood it turns out to have the following beautiful form which we've

Â already seen before where we have the product of probabilities for the

Â three instances and each of these can be we've admitted writing the parameters for

Â clarity, and that's going to be equal to here is.

Â The probability for theta X 0Y0 given the parameters, the second instance and the

Â third instance. And the point is this ends up being a

Â nice decomposable function of the parameters.

Â As, in terms of a product, which if we take the log ends up being a sum.

Â Is a likely it decomposes it decomposes without variables in it, it decomposes

Â within the CPD. What about the incomplete data case?

Â Lets make our life a little bit more complicated and where as before we had

Â these complete instances now notice that these, both of these instances have an

Â incomplete observation regarding the variable X.

Â And now let's write down the likelihood function, in this case.

Â Well the likelihood function, is now the probability of Y0, which is the first

Â data instance, times the probability of X0Y1, which is the second data instance,

Â times another probability Y0. So since p(y0) appears twice, we've

Â squared this term over here. And the probability of y0 is the sum over

Â x of the probability of x, y0. That you have to consider both possible

Â ways of completing the data, x, for the different values of x: x0 and x1.

Â And so if we unravel this expression inside the parentheses it ends up looking

Â like this, theta x zero times theta y zero x zero plus theta x one theta y zero

Â given x one. And the important observation about this

Â expression is that it is not a product of parameters in the model which means we

Â can not take its log and have it decompose over a parameter or the

Â summation because a log of a summation doesn't doesn't decompose.

Â And so that means that our nice decomposition properties of the

Â likelihood function have disappeared in the case of incomplete data.

Â It does not decompose by variables, notice that we have a theta.

Â For the x variable sitting in the same, expression as an entry from the p of y

Â given x cpd. It does not decompose within cpds, and

Â even computing this likelihood function actually requires that we do a sum

Â product computation. So it requires effectively a form of

Â probabilistic inference. So what does that imply, both of these

Â properties that we talked about in the previous slides?

Â What does that imply about the likelihood function?

Â Before, our likelihood function has the form of these gray lines over here.

Â So for example like this, this is a likelihood function of a complete data

Â scenario. The, when we have a multi, when we have a

Â case of incomplete data we're effectively summing up, the probability of all

Â possible completions, of, the, unobserved variables, and so, thee, overall

Â likelihoods function, end up being a product, of,

Â So end up being a summation. Sorry.

Â A summation of likelihood functions that correspond to the different ways that I

Â had to complete the data. So this end up being with this as one

Â such summation. So the likelihood function and the being

Â a sum of. Like a some these nice concave likelihood

Â functions, well log concave likelihood functions, but the point is when you add

Â them all up, it doesn't look so nice at all.

Â It ends up having multiple modes and and it's very much harder to deal with.

Â The second problem that we have, in addition to multi modality, is the fact

Â that the parameters start being correlated with each other.

Â So if you remember, when we were doing the case of complete data.

Â we had the likelihood function being composed as a product of little

Â likelihoods for the different parameters. What happens when we have an incomplete,

Â data scenario? So, when you look at this, you can see,

Â for example, that when X is not observed. So, when X is not observed.

Â 18:54

You have an active V structure that goes from theta YX through Y, all the way to

Â theta X. And so, intuitively, that suggests to us

Â that there is a correlation and interaction between the values that I

Â choose for theta Y given X, and for theta X.

Â And when you think about the intuition for that, it makes perfect sense.

Â Because for example, if theta X chooses to make X0 very, very likely.

Â Then, most of the instances where X is unobserved will be assigned to the X0

Â case. And that's going to basically, have me,

Â , in, assign the data instances to the XO, and that is going to change the

Â estimates of Sayla Y given XO, relative to the Sayla Y given X1, because most of

Â the instances now correspond to XO rather to X1.

Â Lapse in the correlation between them and to see that correlation manifesting lets

Â look at some graphs so what we're seeing here is actually the correlation between

Â two entries in the CPD theta Y given X so over here we see theta Y given X0 and

Â here is theta one given X1. What you see here is the contour plot of

Â the likelihood function. It has eight data points and zero missing

Â measurements. And you can see that this is a nice,

Â product likelihood function with a nice peak in the middle.

Â And there's no interaction between the two parameters.

Â But as we start to gain more and more missing measurements you can see that the

Â curve starts, that, that the contour plot starts at the form.

Â And even with three missing measurements you can see that there is significant

Â interaction between the value that I end up picking for theta Y given X1, and the

Â value that I end up for theta Y1 given X0.

Â So to summarize incomplete data is actually a something that arises very

Â often in practice and it raises multiple challenges and issues how are how are the

Â missing values generated what makes them missing turns out to be very important

Â the fact that there is certain components of the model that are just unidentifiable

Â because there's several equally good solutions so if you pick the best one you

Â better realize that there's others that are equally good out there.

Â And finally the complexity of the likelihood function is another

Â significant complication when doing this kind of when trying to deal with

Â incomplete data.

Â