Local structure that doesn't require full table representations is important in both directed and undirected models. How do we incorporate local structure into undirected models? the framework for that is called log-linear models for reasons that will be clear in just a moment. So whereas, in the original representation of the unnormalized density, we define p tilde to be a product of factors pi Di is, which is potentially, a full table. Now we're going to shift that representation to something that uses a linear form. So here is a linear form that is subsequently exponentiated. And that's why it's called log-linear, because the logarithm is a linear function. So what is this what is this form over here? It's a linear function that has these things that are called coefficients and these things that are called features. Features like like factors, each have a scope which is the set of variables on which the feature depends but different features can have the same scopes. You can have multiple features, all of which are over the same set of variables. Notice that each feature has just a single parameter wj that multiplies it. So, what does this give rise to? I mean, if we have a log-linear model, we have we can push in the exponent through the summation and that gives us some, something that is a, a product of, exponential functions. So you can think of each of these as effectively a little factor, but it's a factory that only has a single parameter wj. Since they are a little bit abstract, so let's look at an example. Specifically, let's look at how we might represent a simple table factor as a log-linear model. So here is a parameter, here is a factor pi over two binary random variables, X1 and X2, and so a full table factor would have four parameters, a00, a01, a10, and a11. So we can capture this model using a log-linear model using a set of such features using a set of these guys which are indicator functions. So this is an indicator function, it takes one. If X1 is 0 and X2 and, and X2 is 0 and it takes 0 otherwise. So this is the general notion of an indicator function. it looks at the event or constraint inside the, the curly braces and it returns a value of 01 depending on whether that event is true or not. And so, if we wanted to represent this distribution as I'm sorry, this factor as a log-linear model, we can see that we can simply sum up over all the four values of k and l, which are either 0 or a 1 each of them, so we're summing up over all four entries here. And we have a parameter or a coefficient wkl, which multiplies this feature. And so, we would have a a product or it, it, or a summation of of wkl, of w00 only in the case that where X1 is 0 and X2 is 0. So, so we would have exp^-w00 when X1 = 0 and X2 = 0 and we would have exp^-wo1 when X1 = 0 and X2 = 1, and so on an so forth. And, it's not difficult to convince ourselves that if we define wkl to be the negative log of the entries, of the corresponding entries in this table that, then that gives us right back the factor that we defined to begin with. So this is so this shows that this is a general representation. In the sense that we can take any factor and represent it as a log-linear model simply by including all of the appropriate all of the appropriate features. So let's, but, but we don't generally want to do that generally, we want a much finer grain set of features, so let's look at some of the examples of features that people use in practice. So here are the features used in a language model, this is a language model that we had discussed that we've discussed previousl. And, here, we have features that relate we have, first of all, let's just reminder ourselves, we have two sets of variables. we have, the variables Y, which represent the anotations for each word in a sequence corresponding to what what, to what category that corresponds to. So this is a person this is the beginning of the person name, this is the continuation of a person name, the beginning of a location, the continuation of the location, and so on, as well as as well as a bunch of words that are not, none of person location, organization and they're all labeled other. And so, the value Y tells us for each word what category it belongs to, so that we're trying to identify people locations and organizations in a sentence. We have another set of variables, X, which are the actual words in in the sentence. Now we can go ahead and define, we can, we can use a full table representation that basically tries to relate each and every Y to that, that has a, a feature, that has a full factor that looks at every possible word in the English language, but those are going to be very, very expensive and a very large number of parameters. And so we're going to define the feature that looks, for example, at F of say, a particular Yi, which is of the label for the ith word in the sentence and Xi being that ith word. And that feature says, for example, YiI) equals person, it's the indicator function for Yi equals person and Xi is capitalized. And so, that feature doesn't look at the individual world, words, it just looks at whether that feature is, that word is capitalized. Now we have just a single parameter that looks just at capitalization and parameterizes how important is capitalization for recognizing that something is a person. We could also have another feature. This is an alternative feat, this is a different feature, it can, and they're all going, it could be part of the same model, that says, Yi is equal to location or actually I, I was a little bit imprecise here, this might be beginning of person, this might be beginning of location and Xi appears in some Atlas. Now, there are other things that appear in the Atlas other than locations but if a word appears in the Atlas, there's a much higher probability presume that it's actually a location. And so and so we might have, again, a weight for this feature that indicates it's that, that, maybe increases the probability in, in Yi being labeled in this way. And so you can imagine that constructing a very rich set of features, all of which look at certain aspects of the word and rather than enumerating all of the possible word, words and, and giving a parameter to each and every one of them. Let's look at some other examples of feature-based model. So, this is an example from statistical physics, it's called the Ising model. And the Ising model is something that looks at pairs of variables, so it's a pair-wise Markov network. And looks at pairs of adjacent variables and basically gives us a coefficient for their product. So now, this is a case where variables are in the in, are, are binary but not in the space 0, 1 but rather -1 of, and +1. And so, now we have a model that's parameterized as features that are just the product of the values of the adjacent variables. Where might this come up? It comes up in the context, for example, of modeling the spins of of, of, of electrons in a grid. So here, you have a case where the electrons can rotate either along one direction or in the other direction. So here is a bunch of the bluer, the, the, the the atoms that are marked with a blue arrow, you have one rotational axis, and the red arrow are rotating in the opposite direction. And, you, and this basically says, we have a term that depend, who, probability distribution over the joints set of spins, so this is the joint, the joint spins in the model, depends on whether adjacent atoms have the same spin or opposite spin. So, notice that 1 * 1 is the same as -1 * -1, so this really just looks at whether they have the same spin or different spins and there is a parameter that looks at, you know, same or different. That's what this feature represents and depending on the value of this parameter over here, if the parameter goes one way, we're going to favor systems that where the atoms spin in the same direction. And if it's going in the opposite direction, you're going to favor atoms that spin in a different direction and those are called ferromagnetic and antiferromagnetic. Furthermore, you can define in these systems the notion of a temperature. So the temperature here says how strong is this connection? So notice that as T grows, as the temperature grows, the Wij's get divided by T and they all kind of go towards 0, which means that the strength of the connection between adjacent atoms effectively becomes almost smooth and they become almost decoupled from each other. On the other hand, as the temperature decreases, then the effect of the correlation, of the interaction between the atoms becomes much more significant and they're going to impose much stronger constraints on each other. And this is actually a model of a real physical system, I mean this is real temperature and real atoms and so on. And sure enough, if you look at what happens to these models as a function of temperature, what we see over here is high temperature. This is high temperature and you can see that there's a lot you know, there's a lot of mixing between the two types of spin. And this is low temperature and you can see that there is much stronger constraints in this in this configuration about the spins of adjacent atoms. Another kind of feature that's used very much in lots of practical applications is the notion of a metric of a metric feature in an MRF. So what's a metric features? It's, this is something that comes up, mostly in cases where you have a bunch of random variables X that all take values in some joint label space of V. So for example, they might all be binary or they might all take values one, two, three, four. and, what we'd like to do is [COUGH] we have well, we have Xi that are connected to each other, Xi and Xj that are connected to each other by an edge. We want Xi and Xj to take values. So in order to enforce the fact that Xi and Xj should take similar values, we need a notion of similarity. And we're going to encode that using a distance function mu that takes two values, one for Xi and one for Xj and says, how close are they to each other? So what does a distance function need to be? Well, a distance function needs to satisfy the standard conditions on a distance function or a metric. So first is reflexivity, which means that if the two variables, if the two variables take on the same value, then that distance better be zero. oh, I forgot to say this, sorry, this needs to be a non-negative function. symetry, means that, that the distances are symmetrical, so the distance between two values V1 an V2 are the same as the distance between V2 and V1. And finally, is the triangle inequality, which says that the distance between V1 and V2, so here is V1, here is V2, and the distance between V1 and V2 is less than, than the distance between V1 and V3, and then going to V2. So the standard triangle inequality. If a distribute, if a distance disatisfies these two conditions, it's called a semimetric, otherwise, if it satisfies all three, it's a called a metric and both were actually used in practical applications. So how do we take this distance feature and put it in the context of an mrf? we have a feature that looks at two variables Xi and Xj and that feature is the distance between Xi and Xj and now we put it together by multiplying that with a coefficient Wij, such that Wij has to be greater than 0. So that, we want the metric MRF so what the effect that that has is that the lower the distance, the higher this is, because of the negative coefficient, which means that the higher the probability. Okay? So, the, the more pairs you have that are close to each other and the closer they are to each other, the higher the probability of the overall configuration, which is exactly what we would wanted to have happen. So, conversely, if you have vary, values that are far from each other in the distance metric, the lower the probability of the model. So here are some examples of metric MRFs. so one, the simplest possible metric MRF is some, is one that gives distance of 0 when the two classes are equal to each other and distance of one everywhere else. So now, there is you know, it's just like a step function. And, despite its, and that gives rise to a potential that that looks like this. So we have zeroes on the diagonal, so we get a bump in the probability when the two adjacent variables take on the same label. And otherwise, we get a reduction in the probability, but regard, but it doesn't matter which particular values they take. That's one example of a simple metric, a somewhat more expressive example might come up when the values V or actually numerical values, in which case you can look at maybe the difference between numerical values, so Vk - Vl, and you want and when Vk is equal to Vl, the distance is 0 and then, you have a linear function that increases the distance as the distance between Vk and Vl grow. So, this is the absolute value of Vk minus Vl. A more interesting notion that comes up a lot in practice is when you don't want to penalize arbitrarily things that are far away from each other in label space. And so this is what's called a truncated linear penalty, and you can see that beyond the certain threshold, there, the penalty just becomes constant. So it plateaus. So there is a penalty, but it doesn't keep increasing over as the labels get further from each other. One example, where metric MRFs are used is when we'ere doing image segmentation and here we tend to favor segmentations where adjacent superpixels, so these are adjacent superpixels and we want them to take the same class. And so, here we have no penalty when the Super superpixels take the same class and we have some penalty when they take different classes. And this is actually a very common, albeit simple model for for image segmentation. Let's look at a different MRF, also in the context of of computer vision. This is an, an MRF that's used for image denoising. So here we have a noisy version of a real image that looks like this. So this is, you can see this kind of white noise overlayed on top of the image and what we'd like to do is we'd like to get a cleaned up version of the image. So here we have a set of variables X that correspond to the noisy pixels. And we'd let, we have a set of variables y that correspond to the clean pixels and we'd like to have a probabilistic model that relates X and X and Y. And, what we're going to do is we'd like, so into, I mean, the the, so you'd like to have two effects on the pixels Y. First, you'd like Yi to be close Yi to be close to Xi, but if you just do that, then you're just going to stick with the original image. So what is the main constraint that we can employ on the image in order to clean it up is the fact that adjacent pixels tend to have the same value. So in this case, what we're gong to do, is we're going to model, we're going to constrain the image, so that, we're going to restrain the Yi's to try to make Yi close to its neighbors and the further away it is, the bigger the penalty and that's a metric MRF Now we could use just a linear penalty, but that's going to be very fragile model, because not obviously, the right answer isn't the model where all pixels are, are equal to each other in their actual intensity value because that would be just like a single, you know, grayish looking image. So what you'd like to do is you'd like to let the one pixel depart from its adjacent pixel if it's getting pulled in a different direction either by its own observation or by other adjacent pixels. And so the right model to use here is actually the the truncated linear model and that one is, is one that's commonly used and is very successful for doing image denoising. Interestingly, almost exactly the same idea is used in the context of stereo reconstruction. There the values that'd you like to infer, the Yi's, are the depth desparitey for a given pixel in the image, how deep it is, and here, also we have spatial continuity we'd like the depth of one pixel to be, close to the depth of an adjacent pixel. But once again we don't want to enforce this too strongly, because you do have depth dexterity's in the image. And so eventually, you'd like things to be able to break away from each other, and so once again, one typically uses some kind of truncated linear model for doing thi theory reconstruction often augmented by other little tricks. So for example here we have the actual pixel appearance, for example, the, the, color and texture and if the color and texture are very similar to each other, we might want to have a stronger constraint on similarity versus if the color and texture of the adjacent pixels are, are very different from each other. They may be more likely to belong to different objects and you don't want to enforce quite as strong of a similarity constraint.