0:00

We've previously talked about maximum likelihood destination from Markov random

Â fields. Though when we first defined Markov

Â random fields, we also defined a, an extension on MRFs called conditional

Â random fields or CRFs. So, now let's think about how the ideas

Â that we developed in the context of MRFs can be utilized for the maximum

Â likelihood destination of a conditional random field.

Â So as a reminder the point of the CRF was to compute the propability of a

Â particular set of target variables Y given an, a set observe variables X. And,

Â the idea behind this was that, although, the unnormalized density p tilde of X, Y

Â was defined in exactly the same kind of parameterization,

Â whether there's a product of factors or as a log-linear model.

Â As in the MRF case, the difference here is that we have a a,

Â an X specific partition function that guarantees that what we have is a

Â normalized distribution over the Y variables given a particular X.

Â So this partition function is the sum over all Ys of the unnormalize measure p

Â tilde theta of X, Y and so it normalizes only all the Ys and not just, and not the

Â Xs. When learning CRFs, our data is now a set

Â of pairs. It's a set of input variables or observed

Â features X and a set of target variables Y.

Â And in the training data, both of these are observed and we'd like to use this to

Â learn conditional distribution of Y given X.

Â So the right objective function that we're going to apply in this context is

Â the what's often called the conditional log likelihood but really should be

Â called the log conditional likelihood, because the conditioning is inside the

Â log. And what we have in terms of this of this

Â objective is the sum over all data instances m of the log of the conditional

Â probability of Y given X. So, if we look at the conditional log

Â likelihood and we basically open it up, we end up with a, an expression that

Â looks very similar to what we saw in the case of MRFs.

Â We have a sum over i of, and this is for the case of a log linear

Â model, we have a sum over i of the parameter

Â theta i multiplied by the value of the feature, fi.

Â This is for a particular instance XY. So it's theta i fi of Xm, Ym.

Â And, over here, we have a in, a negative minus log of the

Â partition function for X of m and this uses the exact same derivation that we

Â had for the case of an MRF. And so now we can sum up over multiple

Â data instances m and compute the derivative relative to a particular

Â parameter theta i of the one over m of the log conditional likelihood.

Â and then, what we end up with is a, again, again an average of, in this case,

Â the sum over two expectations. So, what we have in the, we have a

Â summation over m and the first term is the value of the feature function fi of

Â Xm, Ym and the second term is the expected,

Â expectation of that feature fi relative to Xm and Y.

Â Now, it's important to look carefully at the second expression over here,

Â E theta of fi Xm, Y. And note, that here, the expect, the Xm

Â is fixed and the expectation is taken only over the variables Y.

Â And this is in contrast to the case of MRFs where we were taking the expectation

Â relative to the entire set of variables. Let's see how this pans out in the

Â context of a particular example, so we're going to look a really brain dead notion,

Â model for image segmentation, where we just have two features, f1 and f2.

Â The first feature, f1, is a singleton feature and it takes the value of the

Â average green value for pixels in a superpixel s.

Â So if this is my superpixel s, we average out all of the green superpixels, but we

Â only put them into that feature in cases where that pixel is labelled with g for

Â grass. So and notice that this feature is

Â repeated across all superpixels s in the image, so this is a model with shared

Â parameters. The second feature, f2, is a pairwise

Â feature, and here we get a one for cases where two adjacent superpixels s and t

Â are labeled the same value, say both grass or both cow.

Â And this allows the model to learn how likely it is for superpixels that are

Â adjacent to each other to have the same class label.

Â Notice that this model too had this feature too is shared both across pairs

Â of superpixels, but also across different classes.

Â That is we have the same parameter for cow, cow as we do for grass, grass, or

Â sky, sky. So let's plug in these two features into

Â the the model for the radiant that we had on the previous slide, so this is the

Â general equation and let's see what happens for theta one.

Â So the partial derivative relative to the parameter theta one has the difference of

Â two expectations, again, and empirical expectation, and a

Â and a model expectation. So, this is for particular instance m, so we're not

Â currently averaging across multiple training images. We have a single image m

Â that we're working over. So here, because the model has shared

Â parameters, we're summing over all possible superpixels s in the image and

Â the empirical expectation sums up the product of this indicator function for

Â the superpixel s multiplied the, this avaerage greeness of the superpixel and

Â that's the empirical expectation. The model-based expectation has exactly

Â the same form, we're also summing up over all superpixels s and we similarly have a

Â product of two terms only here, because we don't know whether, because in this

Â model-based expectation, we don't have the Y labels, we're computing the

Â probability over the Y label taking the value grass. That is for each superpixel

Â s, we compute the probability that Ys takes the value green given but we know

Â about the image, that is the, the values x of m and that gets multiplied with the

Â average greenness again. So that's the gradient for the first

Â parameter. The gradient for the second parameter is

Â very similar in spirit. So here, we're looking at the sum over

Â all pairs of adjacent superpixel s and t and the empirical expectation just sums

Â that over all those pairs with an indicator function of Ys is equal to Yt.

Â So we a one if Ys is the same as Yt Say, they're both labeled grass or they're

Â both labeled cow, and zero otherwise. And in the, in the model based

Â expectation term, we have the probability of that event.

Â That is, the probability that ys is = to yt.

Â given the image. And once again, that's summed up over all

Â pairs of, adjacent superpixels, s and t. So, again, a difference of two

Â expectations in both cases. An empirical expectation, and a model

Â based expectation. Taking that and thinking about this let's

Â compare the computational cost of two of two models, of two training regimes.

Â One is the MRF training and the second is the CRF training.

Â So in the MRF training remember that our gradient relative to parameter theta I

Â was the difference of two expectations. And we've already pointed out that the

Â computation of the second expectation, the model based expectation requires that

Â we run inference at each gradient and we thought that was pretty expensive or it

Â could be, but now lets see what happens for serifs, lets look at that gradient

Â for more time, for the case of serifs and here notice that we don't really have.

Â That, that we have a summation over M. That is a summation over all of the data

Â instances of a difference of two terms. The second is the feature value in that

Â particular instance. And the second, is the expected feature

Â value in that particular instance, X of M.

Â So. The important observation to come out of

Â this is that this expectation, the second expectation, relative to theta, is an

Â expectation that is different for each of our input instances.

Â So that means that we need to rerun this inference for each and every one of our x

Â m's at each gradient step. So whereas mrf's requires one inference.

Â At each gradient step. This requires M inferences at each

Â gradient step. Where M is the number of training

Â instances. And so here, the inference required to do

Â learning is considerably more expensive. However, one needs to weigh the balance

Â of all of the factors that contribute to computational cost.

Â So, when we're doing inference of the conditional probability, P of Y, given X.

Â We only need to worry about the y variables.

Â So the Xs are all fixed, their values are instantiated.

Â And so our factors in the resulting model are factors only over the variables one.

Â If we were to want to use mrf training because say we decided that crf trainings

Â too expensive because of the additional cost of inference we would need to

Â compute the joint probability p of yx which.

Â Might be much more complicated. And that might be much more complicated

Â not only because of the fact that we have more variables now in the model both the

Â y's and the x's. But also because in many cases where we

Â would like to apply crf, the distribution that we might have over the x's becomes

Â quite un-, unmanageable if we want to compute a distribution over that.

Â So to understand that, let's go back to the previous example and consider this

Â very simple image segmentation model. And here notice that the x.

Â The, the variable x sub s implies this This average greenness of the superpixel.

Â Now, in the context of CRFs, we don't need to compute a distribution over

Â average greenness, because we're conditioning on the X of s.

Â And so g of s is now fixed, it's instantiated.

Â And we're only computing the probability distribution over the ys.

Â But if we wanted to train this model as an MRF, we would need to somehow maintain

Â a distribution over this average greenness.

Â And since that's a continuous quantity. It actually requires that we think about

Â parametric forms. And is it a Gaussian, or, or a mixture of

Â Gaussian. Gaussians, and it becomes quite hairy to

Â manage. And so.

Â Although there, the CRF based training in this case, might have additional cost, it

Â can also reduce the cost, in terms of any particular inference step as well as

Â avoided a bunch of thorny issues regarding, parameterization of the

Â distribution over various features. [SOUND] So to summarize, CRF learning, in

Â terms of the mathematical formulation is very similar to MRF learning.

Â The likelihood function has the same form, it's concave.

Â It's similarly optimized using gradient ascent, usually the same L-BFGS

Â algorithm. However the gradient computation is now

Â much more expensive, because it requires inference not only once per gradient

Â step, but also once per gradient step and data instance,

Â and as a comparison, once per gradient step in the context of MRF's.

Â But as we already said, as we just said, the conditional model is often much

Â simpler, so the inference calls for a CRF and the MRF is not the same, so we're not

Â really comparing apples and oranges. And, so in the context of any given

Â model, one really needs to think about which of these models is more

Â appropriate. And not only based on the computational

Â costs of training, but also in terms of the overall performance and

Â generalization.

Â