A lot of our thinking so far has gone into constructing algorithms for the problem of computing marginals in a graphical model, but as we previously discussed a very different type of inference problem, but one that has many applications in itself is that of finding a single coherent joint assignment that has the highest probability and we showed already that, that doesn't have the same that you can't just solve the marginals problem. And pick the highest probability assigned for each variable. And that gives you the map assignment. That's not what, that's not what happens. And so we need to have a null, a different set of algorithms for solving the math problem. And we're going to talk now about the first of these, which is effectively following the exact same lines that we did for exact inference for computing marginals. Only it turns out that one can repurpose them with a small change to computing the map assignment. Now the first operation that we need to do in order to get this to work is we're going to take products and turn them into senations. And we're going to be doing that by observing p of the, our distribution P of I is proportional as we know to a product of factors. And if we if we're trying to find the argmax of that, particular product it can also be written re, reformulated as the argmax of the summation. Where each of these theta K's is the log of the potential 5000 which is basically the same as taking the table that represents the factor and converting each entry into its logarithm. And that gives you something that rather then being a product of factors, is the sum of these log factors. And there's really good reasons for doing that first, because summations are easier to handle than, than products. But also because, if you're multiplying a whole bunch of very small numbers. You're going to get numerical under-flows. And converting this into summations is a much more robust algorithm, numerically. In, in terms of a practical implementation. And so now we're going to consider the problem of, that we just wrote here which is the argmax. an expression that we're going to call theta of X1 up to XN which is defined as the sum of factors over smaller scopes which are these theta Ks. And this is the example that I meant to show a moment ago. So this is taking a factor over scope a and b and converting it to its log factor by taking log base two. And you'll notice that the numbers were carefully selected to give me nice outputs, but that doesn't have to be the case. So now that we've reduced problem to one where we are taking the max over summation, we're going to now go back to our simple example of a chain. And we're going to think about how we can do max sum variable elimination in chain. And this is going to look. It almost identical to the max, to the sum product algorithm that we had before, so let's, understand what we're trying to do here. So we're trying, assuming that what we're trying to do now is forget the argmax. Let's assume that we just want to find the max of this, of this expression, theta of A, B, B, D, and E. And so just like in the sum product case we can break up the maxes in to a max over A and then a max over B and a max over C and a max over D. And what we might not be quite as used to doing this kind of operation that I'm just, that I'm about to show it's equally valid to point out that because none of these guys say the two, say the three or say the four depend on A, we can, add them up, the max over, this guy over theta one and AB, after computing the max. And so that's going to give me, a, factor here which which looks like max over A, theta 1 of AB, and I'm going to call that lambda 1 of B, because notice that this is not a constant, it's a function that depends on B, for different values of B, your going to have different, different A's of maximize and different values of the max expression. And so, this lambda 1 of B is simply the max over A of theta 1 AB. And that process has effectively now eliminated A, from this expression, and given me A maximization problem over one fewer variable. only B, C, D, and E. Just like in the context of sum product algorithms we can view all of these operations as operations over factors rather than just thinking about them in terms of expression. So whereas before we defined things like factor product and factor marginalization, we can now define analogous operations that correspond to factor summation and factor maximization. So factor summation, a very obvious operation does exactly what we did here, so who want in, in, in case of factor product, if we want to define. And the row for a1b1c1 we're going to add up. The entry three from A1B1 and the entry four from B1C1 and that's going to give me 7. And I can similarly define all of the other entries in the factor summation which we're showing here on the right. Factor maximization, is the same way that we could sum marginalize a factor. This is something that, max marginalizes. It's called the max marginalization and what it does is it looks, if I, so I'm trying to get rid of B I have these two rows here, say, for example, A1, B1, C1 and A1 B2 C1 that only differ from B, And my new entry here A1 C1 is going to be the max of these two entries over here which, in this case, is going to be seven. And similarly for say, A1 C2, I'm going to end up with a max of 4.5 and two, which is 4.5. So this is another form of marginalization where I remove the variable b but using an operation other than summation. So, now that we've defined these two operations, we can go back and define the max sum variable elimination in chains as a set of factor operations where I eliminate A, then B and so on and so forth. So, for example, having eliminated A as in the previous step I can now do the exact same operation by noticing that factors that depend on B are theta 2B and 1B so these other two. I can move outside of the maximization. And that's going to give me a max over b of a factor that is the sum of the two factors, feta 2bc. And lambda one of b that I got from the previous elimination step. And that's going to give me a new factor lambda two of c and the process continues in exactly the same way as variable elimination did for the case of some product. And that's basically the algorithm. So now let's think about what we get at the end of this execution for the final factor. What is lambda four, having gone through lambda two and lambda three, now we get lambda four of e. What is lambda four for a given value little e? Lambda four little e is what we got by maximizing over A B C and D of thena A B C D and E. So this is the best possible assignment, the best value that I can get, that I can possibly get. If we mandate. That E equal little E, okay And so, this is a factor, and it gives me back value for each one of the values, the score, effectively for each value little E of. This is called a max marginal. In the same way that we took variable elimination for some product and we use that to define a clique tree algorithm, we can do exactly the same thing for max product, for, or max sum using the exact same type of data structure. So, here we, we're going to use the exact same clique tree and we have cliques over AB, DC, CD and DE, and we're going to as, as usual assign the potentials, know the appropriate cliques using the family preservation property. And now let's go ahead and see how messages are passed in this clique tree architecture. So, initially the AB clique is going to defines the message lambda 1, 2 which is obtained by maximizing out A over theta 1 and the resulting message over B gets passed to clique 2. clique 2 can take that message, multiply it with its own factor theta 2 and add that, this is max sum remember, I'm going to add that to lambda 1, 2 and that gives me the message lambda 2, 3 and the same thing can happen when clique 3 passes a message to clique 4 and in this case over the scope D. So exactly the same message passing process except using max and sum instead of sum and product. We can equally well pass messages in the other direction. So 4 to 3 over scope D, 3 to 2 and 2 to 1 and all of the properties that we said hold in the context of some product clique trees hold here as well first notice for example that once this message lamda one, two is sent it never changes again its its defined once and for all. lambda two three, once it receives the message lambda one two, it too has stabilized and will never change. And similarly from lambda three four. So we can do one pass from left to right to compute all of the left to right messages. And a similar pass right to left to compute all of the right to left messages. And as soon as we do that, they've all converged. The second thing to notice is what is the value of this clique over here. So if we look at this, we can see that when I've computed, when I finally get all of the messages from both sides, I have integrated in theta 1, theta 2, theta 3 which is stored in the clique itself, and theta 4 coming in from the message on the right. And maximize out all of the variables A, B, and E so that what I have left is a factor over the clique three, which is the max over A, B, and E of the sum of all of the factors in this network. So just to summarize once the clique I receives the final message from all of it's neighbors except cj then that message is also final, will never change, and the messages from all leaves are immediately final and so what we have is an algorithm that at converges that converges after two passes and gives us correct max marginals at every single one of the cliques. So let's take a simple example just to convince ourselves that this is doing the right thing. So, I'm, looking now at a much simpler not, only have A, D, and C. So, there's two factors one over theta one of A, D and theta two. Was b, c. And first, we're going to construct the overall max feta, which is the sum of feta one and feta two. So this is a, factor summation, went into this expression, and let's look at it, and see what is, the map assignment, in this example, and if we just look at the numbers that we computed, we see that the map assignment is A1D1C1 with a value of seven. Now let's look at the message passings process that you would have on this very simple cleet tree with two cleets. So this is Theta one, it's assigned to clique one, Theta two is assigned to clique two and let's look at the two messages that are passed. So here was my A1, so I'm, AB passes a message to BC and that message is the max marginal. Our max marginalization over, the variable A. And so we can see that here we have a max between three and, N, negative one. So we get three because this is the max over the two values of A that are consistent with B1. And for B2 we have the max over zero and one, so we get one. And exactly the same process gives us, for this message. Where I'm max marginalizing C, I get this message that's passed from left to right. Now, each of these two cliques takes its message. Notice that I've gotten immediate convergence here, 'because there's only one message to pass in each direction. And so what I get here is the sum of this factor plus the incoming message. I've added these two together and so I have for example for the first row A1, for A1B1 I have 3 from here and 4 from here so I get 3 plus 4 which is equal to 7. For A1 B2 I get a 0 from here, and a 2 from there. 0 + 2 = 2. And I can do exactly the same operation to get this to get this a set, to get this factor on the right, I get that by summing up this with that, and I get this factor over here. And you'll notice that, miraculously the MAP the max in each of these factors separately is A1B1 on the left and B1C1 on the right. So, sure enough, what I got here was the most likely assignment consistent with A1B1 and over here the most likely assignment consistent with B1, C2. And you'll notice if you go back and check and I'm not going to do that right now, but if you go back and check this big table over here you can convince yourselves that this doesn't only hold true for A1B1. If you look, for example at A2 B1, you'll see that what you get here, the value three is the value of the most likely assignment that are consistent, that is consistent with A2 B1. So actually let's go back and check that, so here is A2 B1. We have two assignments consistent with A2B1 one is 0.5 and one is 3. 3 is the best one and sure enough the value that we get here is 3 so it all works. So what will we say about this algorithm once it converges. The important think is that we can complete beliefs of each cleek that represent exactly the max marginals of that cleek. So how do we compute these beliefs and now listening to what we had in the context of some product we look at the factor assigned to the clique which is the theta i and we add to it the incoming messages. Before you remember we had the factor scientific cleet multiplied by the incoming messages because we were doing products instead of semations. What does this what does this belief in code this belief is exactly the max marginal. So specifically, it, we can consider for any given assignment to the clique CI. We can look at the best possible, the score of the best possible completion to the clique CI. And that is the value of the belief at that clique. So, it's the max over all of the variables WI, that are not assigned to the clique. One important consequence of the fact that we have max marginals is that we get a, a calibration property here, as well. So, the cliques must agree on the variables that they share. So to understand that let's look at these two cliques that we have in the simple two clique example. And here are the beliefs that we had for clique one and for clique two. It's the sum of theta one plus the incoming message. Over here. And, theta two plus lambda two. And let's look at what the calibration property tells us. It tells us that for example, if we look at. The implications of this cleek regarding the variable b, we would have that the, this cleek tells us that the best possible completion for b1 has a score of seven and the best possible completion for b2 has a score of three. This clique if we look at, it tells us that the best possible completion for b1 also has a score of seven and the best possible completion for b2 has a score of three. So, these two clique agree regarding there regarding the variable in the shared scope which is the variable B. So to summarize, we can apply in the context of max sum exactly the same clique tree algorithm used for sum product. Messages are passed in the same way. The clique tree is constructed in the same way. The only difference is that the message passing operations use max and sum as opposed to sum and product. At exactly as in some product convergence is acheived after a single up and down pass, and the result of that is a set of beliefs that represent max marginals at each of those cleeks where as a reminder, the marginals tell us for each assignmnet what is the score of the best possible completion to that assignment.