We've previously talked about maximum likelihood destination from Markov random fields. Though when we first defined Markov random fields, we also defined a, an extension on MRFs called conditional random fields or CRFs. So, now let's think about how the ideas that we developed in the context of MRFs can be utilized for the maximum likelihood destination of a conditional random field. So as a reminder the point of the CRF was to compute the probability of a particular set of target variables Y given an, a set observe variables X. And, the idea behind this was that, although, the unnormalized density p tilde of X, Y was defined in exactly the same kind of parameterization, whether there's a product of factors or as a log-linear model. As in the MRF case, the difference here is that we have a a, an X specific partition function that guarantees that what we have is a normalized distribution over the Y variables given a particular X. So this partition function is the sum over all Ys of the unnormalized measure p tilde theta of X, Y and so it normalizes only all the Ys and not just, and not the Xs. When learning CRFs, our data is now a set of pairs. It's a set of input variables or observed features X and a set of target variables Y. And in the training data, both of these are observed and we'd like to use this to learn conditional distribution of Y given X. So the right objective function that we're going to apply in this context is the what's often called the conditional log likelihood but really should be called the log conditional likelihood, because the conditioning is inside the log. And what we have in terms of this of this objective is the sum over all data instances m of the log of the conditional probability of Y given X. So, if we look at the conditional log likelihood and we basically open it up, we end up with a, an expression that looks very similar to what we saw in the case of MRFs. We have a sum over i of, and this is for the case of a log linear model, we have a sum over i of the parameter theta i multiplied by the value of the feature, fi. This is for a particular instance XY. So it's theta i fi of Xm, Ym. And, over here, we have a in, a negative minus log of the partition function for X of m and this uses the exact same derivation that we had for the case of an MRF. And so now we can sum up over multiple data instances m and compute the derivative relative to a particular parameter theta i of the one over m of the log conditional likelihood. and then, what we end up with is a, again, again an average of, in this case, the sum over two expectations. So, what we have in the, we have a summation over m and the first term is the value of the feature function fi of Xm, Ym and the second term is the expected, expectation of that feature fi relative to Xm and Y. Now, it's important to look carefully at the second expression over here, E theta of fi Xm, Y. And note, that here, the expect, the Xm is fixed and the expectation is taken only over the variables Y. And this is in contrast to the case of MRFs where we were taking the expectation relative to the entire set of variables. Let's see how this pans out in the context of a particular example, so we're going to look a really brain dead notion, model for image segmentation, where we just have two features, f1 and f2. The first feature, f1, is a singleton feature and it takes the value of the average green value for pixels in a superpixel s. So if this is my superpixel s, we average out all of the green superpixels, but we only put them into that feature in cases where that pixel is labelled with g for grass. So and notice that this feature is repeated across all superpixels s in the image, so this is a model with shared parameters. The second feature, f2, is a pairwise feature, and here we get a one for cases where two adjacent superpixels s and t are labeled the same value, say both grass or both cow. And this allows the model to learn how likely it is for superpixels that are adjacent to each other to have the same class label. Notice that this model too had this feature too is shared both across pairs of superpixels, but also across different classes. That is we have the same parameter for cow, cow as we do for grass, grass, or sky, sky. So let's plug in these two features into the the model for the radiant that we had on the previous slide, so this is the general equation and let's see what happens for theta one. So the partial derivative relative to the parameter theta one has the difference of two expectations, again, and empirical expectation, and a and a model expectation. So, this is for particular instance m, so we're not currently averaging across multiple training images. We have a single image m that we're working over. So here, because the model has shared parameters, we're summing over all possible superpixels s in the image and the empirical expectation sums up the product of this indicator function for the superpixel s multiplied the, this average greeness of the superpixel and that's the empirical expectation. The model-based expectation has exactly the same form, we're also summing up over all superpixels s and we similarly have a product of two terms only here, because we don't know whether, because in this model-based expectation, we don't have the Y labels, we're computing the probability over the Y label taking the value grass. That is for each superpixel s, we compute the probability that Ys takes the value green given but we know about the image, that is the, the values x of m and that gets multiplied with the average greenness again. So that's the gradient for the first parameter. The gradient for the second parameter is very similar in spirit. So here, we're looking at the sum over all pairs of adjacent superpixel s and t and the empirical expectation just sums that over all those pairs with an indicator function of Ys is equal to Yt. So we a one if Ys is the same as Yt Say, they're both labeled grass or they're both labeled cow, and zero otherwise. And in the, in the model based expectation term, we have the probability of that event. That is, the probability that ys is = to yt. given the image. And once again, that's summed up over all pairs of, adjacent superpixels, s and t. So, again, a difference of two expectations in both cases. An empirical expectation, and a model based expectation. Taking that and thinking about this let's compare the computational cost of two of two models, of two training regimes. One is the MRF training and the second is the CRF training. So in the MRF training remember that our gradient relative to parameter theta I was the difference of two expectations. And we've already pointed out that the computation of the second expectation, the model based expectation requires that we run inference at each gradient and we thought that was pretty expensive or it could be, but now lets see what happens for serifs, lets look at that gradient for more time, for the case of serifs and here notice that we don't really have. That, that we have a summation over M. That is a summation over all of the data instances of a difference of two terms. The second is the feature value in that particular instance. And the second, is the expected feature value in that particular instance, X of M. So. The important observation to come out of this is that this expectation, the second expectation, relative to theta, is an expectation that is different for each of our input instances. So that means that we need to rerun this inference for each and every one of our x m's at each gradient step. So whereas mrf's requires one inference. At each gradient step. This requires M inferences at each gradient step. Where M is the number of training instances. And so here, the inference required to do learning is considerably more expensive. However, one needs to weigh the balance of all of the factors that contribute to computational cost. So, when we're doing inference of the conditional probability, P of Y, given X. We only need to worry about the y variables. So the Xs are all fixed, their values are instantiated. And so our factors in the resulting model are factors only over the variables one. If we were to want to use mrf training because say we decided that crf trainings too expensive because of the additional cost of inference we would need to compute the joint probability p of yx which. Might be much more complicated. And that might be much more complicated not only because of the fact that we have more variables now in the model both the y's and the x's. But also because in many cases where we would like to apply crf, the distribution that we might have over the x's becomes quite un-, unmanageable if we want to compute a distribution over that. So to understand that, let's go back to the previous example and consider this very simple image segmentation model. And here notice that the x. The, the variable x sub s implies this This average greenness of the superpixel. Now, in the context of CRFs, we don't need to compute a distribution over average greenness, because we're conditioning on the X of s. And so g of s is now fixed, it's instantiated. And we're only computing the probability distribution over the ys. But if we wanted to train this model as an MRF, we would need to somehow maintain a distribution over this average greenness. And since that's a continuous quantity. It actually requires that we think about parametric forms. And is it a Gaussian, or, or a mixture of Gaussian. Gaussians, and it becomes quite hairy to manage. And so. Although there, the CRF based training in this case, might have additional cost, it can also reduce the cost, in terms of any particular inference step as well as avoided a bunch of thorny issues regarding, parameterization of the distribution over various features. [SOUND] So to summarize, CRF learning, in terms of the mathematical formulation is very similar to MRF learning. The likelihood function has the same form, it's concave. It's similarly optimized using gradient ascent, usually the same L-BFGS algorithm. However the gradient computation is now much more expensive, because it requires inference not only once per gradient step, but also once per gradient step and data instance, and as a comparison, once per gradient step in the context of MRF's. But as we already said, as we just said, the conditional model is often much simpler, so the inference calls for a CRF and the MRF is not the same, so we're not really comparing apples and oranges. And, so in the context of any given model, one really needs to think about which of these models is more appropriate. And not only based on the computational costs of training, but also in terms of the overall performance and generalization.