0:00

Perhaps the most common use of the expectation maximization algorithm is to

Â the problem of Learning with Latent Variables these are situations where

Â there is a certain set of variables that we just never get to observe but they

Â turn out to be important for capturing some important latent structure in the

Â distribution of the data instances that we have.

Â So let's look at a number of these examples and talk about how 'em was

Â applied in those settings. So perhaps the simplest example, is in

Â context of base in clustering. Where we have some kind of, latent class

Â variable and some set of observed features.

Â And we're trying to, basically segment the instances that we have into

Â categories. Where, we assume that, for each category

Â there is a similar distribution over the observed features.

Â And, specifically, in many of these applications, that distribution takes the

Â form of a [INAUDIBLE], [INAUDIBLE], and that is, that the features are

Â independent, once we're given the class. So one example application of this is to

Â segmenting a set of users into similarity groups.

Â And so one cute example of that is the following segmentation of users on the

Â MSNBC website based on which of the stories on the MSNBC site the users chose

Â to click and this is due to Jack Breeze and colleagues at Microsoft research So

Â here we see. We're going to see four clusters.

Â The first of these is a cluster that a human given name to,

Â because the machine doesn't give a name to the cluster,

Â it just calls it cluster one. Turns out to be readers of commerce and

Â technology stories. So these are people for whom the highest

Â probability clicks on stories were for example email delivery isn't exactly

Â guaranteed or should you buy a DVD player.

Â So remember, the stories are the features and their binary variables did you click

Â or did you not click, on that particular article.

Â And the cluster is the segmentation of the users into these different

Â categories, so this is category one.

Â the second cluster are people who primarily clicked on sports stories.

Â So, the highest probability features in this, for this user group were for

Â example, stories such as Cowboys Are Reborn in Win Over Eagles,

Â or something like that. The third cluster are the people who went

Â to the top page and basically clicked on the top stories.

Â This says twenty, 29% of the users. And

Â one can. Date this by, at least this article, oh

Â this one seems pretty perennial. and so we can so that's a third category.

Â The last one was the surprising one. These are people who traversed all over

Â the site. So for the previous three categories,

Â there were actual pages that existed for sports, for top promoted stories, and for

Â commerce and technology. This last category are people who went

Â all over the site to look for stories that you might call softer news.

Â and, so, this was an interesting discovery for the MSNBC website about its

Â user population and suggested that the website might be better redesigned to

Â capture this kind of structures and give these users a page where they can find

Â these articles. A very different application is to speech

Â recognition, which is one that we've talked about before, in the context of

Â speech recognition, we've already, discussed that the standard

Â representation of choice is a hidden Markov model, and in a hidden Markov

Â model, we have hidden variables and we have parameters.

Â The parameters are the parameters of the HMM, the transition probabilities for

Â example, between one phone and the other, or the acoustics or the probability over

Â the acoustics signal that we observe for different pieces or states within a

Â phone. The latent variables are the segmentation

Â of the initial acoustic signal into which state it belongs to.

Â That is which full name and which part of which full name,

Â those are the hidden variables. And that's an awful lot of hidden

Â variables. Because if we are just given an

Â undifferentiated, unsegmented signal, it's very difficult to assign.

Â As, as a piece, a small slice of that segment of, of that speech into one fully

Â versus the other. So in order to train well a speech recognition HMM, the

Â standard strategy is a bottom up bootstrap training where one first trains

Â the phone HMM separately for each phone using, a phone database.

Â 5:08

Now this to is done using the m because the breakdown within a phone into these

Â little constituent pieces is not labeled. And so it's still requires that we d 'em

Â to address the issue of latent variables, but at least it's now self contained,

Â because you're only training a model for a single phone.

Â like puh. With that model trained, we can now one

Â can now take entire words, and use the model, initialized with the phone with

Â the model trained on individual phones, to now train the higher level model.

Â And one still retrains the phone HMM parameters.

Â In the context of this larger training regime where one trains on entire words,

Â but the fact that one seeded the model with this much more correct

Â initialization on the parameters allows the segmentation in the E step to be

Â performed moderately correctly and give rise to a much better local optimum in

Â the speech recognition problem. A yet different application is that of 3D

Â robot mapping. So, this is an application that's due to

Â thrun at all, and . Here the input to the, to the problem is

Â a sequence of point clouds obtained by a laser range finder that were collected by

Â a moving robot and the target output is a 3D map of the environment as a set of

Â plains and you will see the difference as to why we want the plainer map when we

Â look at the demo. Here the parameters of the model are the

Â location and the angles of walls or the plains in the environment, so we have no

Â idea of priori where there walls and how there situated.

Â The latent variables are as an association.

Â Variables, which assign points to walls. And so the EML Rhythm effectively in the

Â E step figures out which points go with which wall, and in the M step, it figures

Â out how to move the wall around to better fit the points that were assigned to it.

Â So what we see here is the raw data collected by the robot.

Â The red box is the robot moving around in its environment.

Â The red beams that emanate from the robot are the directions that the laser took in

Â order to collect the point cloud and what we see here is the point cloud that was

Â collected by the robot as a Traversty environment.

Â And even just looking at this image we can already see that there is a lot of

Â noise in the laser range data, and that and that is going to give rise as we can

Â see to a very noisy map of the environment.

Â If we now look at the marvel constructed by Yen.

Â So now we are going to see the planer map constructed by the robot.

Â And this is done on the fly actually as the robot is moving.

Â We can see that. walls are constructed when there is

Â enough data to support them and, that's when enough points are assigned to a wall

Â then, that wall gets constructed and it's pose in the environment gets determined

Â by the EN algorithm. And, so we can see that everybody's

Â construct a much more plausible and realistic map of the environment than

Â just looking at the raw point cloud data. .

Â A, different application is also in the context of 3D laser range scans.

Â And, you know, we pick those, not because these are in the most common applications

Â of the M. But because they give rise to some pretty

Â cool movies. So, here is, the problem of getting, in

Â this case 3D, brain scans of a person in different

Â poses. And the goal is to see whether by

Â collection a bunch of these poses one can reconstruct a 3D skeleton of.

Â Person. So from front and front back.

Â So, in this particular case. the first problem is actually to

Â correspond points in the difference cans to each other and in this case that was

Â done by a different algorithm that I'm not going to talk about now although it

Â also used graphical models in fact it used a belief propigational algorithm.

Â But now lets talk about the clustering problem that is the problem of assigning

Â points in each of those scans to body parts.

Â So here we have the notion of a cluster which in this case corresponds to a part

Â and. Each part.

Â in a given, in a given, scan of the person, has a transformation, so that's

Â why we have a plate around, each of the parts, because we have an, extenty, fo,

Â because, for each of those parts there is multiple estantions and then multiple,

Â scans that we have the same person in different poses.

Â Now a landmark is a particular point on that

Â On that person, in that, in that part, and if we knew the part, that is, if we

Â had the part assignment, which is a latent, unobserved variable, then we

Â could predict how the point that we see. On, this scan, would translate to the

Â point on this scan. Because that would be effectively a,

Â deterministic or close to deterministic function of the unobserved,

Â transformation of the part. That is the fact that the arms, they

Â moved from this pose over here to this raised arm pose over here.

Â So given the transformation and given that we know which part.

Â The point or landmark is assigned to. We can predict how the point transformed

Â from its original position to its new position.

Â So this is a way of clustering points into parts and one can run 'em on that

Â and if one does that effectively one gets pretty much garbage, because it turns out

Â that there is enough muscle deformation to make the actual positions here fairly

Â noisy and it's very difficult to get correct part assignments from that, but

Â if we now add an additional component into this model.

Â Specifically continuity of space, we can now consider say two points that are

Â adjacent to each other and impose the soft constraint, not the hard constraint

Â but a soft constraint that a part assignment of two points that are

Â adjacent should be softly the same, doesn't have to be exactly the same

Â because otherwise of course everything would be assigned to the same part, but

Â it's a soft constraint which is actually an MRF constraint.

Â In fact it's even an MRF constraint, which is associate or regular.

Â As we discussed before. And so these this model now that actually

Â does the clustering not of each point in isolation but rather assigns points

Â jointly to parts taking into consideration the geometry of the

Â person's body is a considerably better gives rise to considerably better

Â outputs. And what we see here is the algorithm in

Â action. And we can see that the algorithm

Â converges very nicely to a partition. that points into different parts and from

Â this one can easily reconstruct the skeleton.

Â A final application that uses 'em in a different model is one that was used in

Â this helicopter demo alignment. Here the input and this is work that was

Â done at Stanford by And the rings group.

Â And, here the input to the algorithm is basically, different trajectories of the

Â same aerobatic maneuver flown by different pilots.

Â So they were all trying to do the same thing, but each person has their own

Â idiosyncratic way of doing this. And so the, the exact sequence wasn't, as

Â we'll see, exactly the same. The goal of this was to produce an output

Â which aligns the trajectories to each other and at the same time learns a

Â problemistic model of what the target or template trajectory ought to have been.

Â 14:25

So how does this get represented as a graphical model, here we have, a latent

Â trajectory which is the intended trajectory that, that, the users were

Â aiming for and, the states an nose are labeled as zees and we have parameters,

Â that tell us how, one would move from Z0 to Z1 and so on.

Â So those are the model parameters of this mark off model.

Â although it's a hidden Markov model, as we'll see.

Â The expert demonstrations, that is, what we saw in terms of what the expert did at

Â each point in time, is a noisy observation of the intended trajectory.

Â But, we don't actually know which point in the trajectory the expert observation

Â comes from. And so we have this additional set of

Â latent variables, which are the time indices.

Â And they basically tell us that at time tau zero.

Â The demonstration which of the. which of the true, states in the intended

Â trajectory, the expert was at, at this point and, in towel one, tells us which

Â point in the intended trajectory the expert was in time one, and similarly for

Â time two, and so on. We're, Kebba is the, is the expert

Â demonstration, so in the, so K is the index, on the expert.

Â 17:32

So one important issue regarding latent variables is the number of values of the

Â latent variables, and this is something that, has been implicit in many of the

Â applications that I've mentioned in this module.

Â So, how many user clusters do we have? Or how many, different pieces of a phone

Â are there? And, so how do we pick that cardinality

Â for the latent variable? If we use likelihood to evaluate a model

Â with fewer values for the latent variable versus one with more.

Â It's in the same way that likely it always over fits.

Â We will find more values is always better, because it is a strictly more

Â expressive model and can therefore achieve a higher likelihood.

Â Now, one can circumvent that in the same way that we circumvented, the likelihood

Â overfitting phenomenon in the context of other structured learning algorithms.

Â So we can use, for example, a score that penalizes complexity.

Â Such as the BIC score. as we mentioned back then, the BIC score

Â tends to under-fit and so there has been a range of extensions to the Bayesian

Â score for the context of incomplete data. These are always approximations because

Â we can't actually do the integration required for the Bayesian score for

Â incomplete data. And so we have approximations to the

Â Bayesian score that turn out to work much better in practice than BIC in many

Â practical applications. There's also other strategies that people

Â have adopted. For example, we can use metrics of

Â cluster coherence to look at a cluster and say.

Â Ooh, it seems like this cluster is really incoherent.

Â So maybe there's really two clusters in there.

Â So maybe we should split it up. Or if these two clusters are very similar

Â to each other, maybe we should merge them.

Â And there is various heuristic exploration algorithms that basically

Â split clusters, and add clusters in a way that hopefully converges to the right

Â number of clusters. And finally of great popularity in the

Â last few years, is a range of methods that are based on Dasian techniques.

Â these are methods, often, the most commonly used class is called [UNKNOWN]

Â prostheses. Which instead of picking a single

Â carnality for the laten variable, basically maintain a distribution, over

Â the carnality. And then, instead of picking a single

Â value for that cardinality average over the different cardinalities using often

Â techniques such as Markov Chain Monte Carlo.

Â Although there's also other approaches. But this turns out often the, the problem

Â of picking out latent variable cardinality is often one of trickier

Â issues that one has to do deal with when learning with latent variables.

Â To summarize. Latent variables are a really common

Â scenario in the context of learning graphical models.

Â And one of, perhaps, the most common setting for incomplete data.

Â And it turns out that, when we want to construct models for richly structured

Â domains. The introduction of these latent

Â variables can be, Can be critical in the success of the

Â model. We it's, we've already mentioned that

Â latent variables satisfy the missing at random assumptions.

Â And so the expectation maximization algorithm is applicable in this case.

Â And, and is very commonly used for for optimizing latent variable models.

Â However, it's we've, we've already discussed this previously, as well.

Â That when we're learning with latent variables, we have serious problems both

Â with identifiability. Of the learned model as well as well as

Â multiple optima that are often even symmetric reflections of each other which

Â is a faceted unidentified ability but also ones that are not reflections or

Â just symmetric permeantations of each other and those really necessitate a good

Â initialization strategy. And we mention that picking variable

Â cardinality for the latent variables is a key question in in these latent variable

Â models and one to which a lot of work has been devoted to especially in the last

Â few years.

Â