0:00

A lot of our thinking so far has gone into constructing algorithms for the

Â problem of computing marginals in a graphical model, but as we previously

Â discussed a very different type of inference problem, but one that has many

Â applications in itself is that of finding a single coherent joint assignment that

Â has the highest probability and we showed already that, that doesn't have the same

Â that you can't just solve the marginals problem.

Â And pick the highest probability assigned for each variable.

Â And that gives you the map assignment. That's not what, that's not what happens.

Â And so we need to have a null, a different set of algorithms for solving

Â the math problem. And we're going to talk now about the

Â first of these, which is effectively following the exact same

Â lines that we did for exact inference for computing marginals.

Â Only it turns out that one can repurpose them with a small change to computing the

Â map assignment. Now the first operation that we need to

Â do in order to get this to work is we're going to take products and turn them into

Â senations. And we're going to be doing that by

Â observing p of the, our distribution P of I is proportional as we know to a product

Â of factors. And if we if we're trying to find the

Â argmax of that, particular product it can also be written re, reformulated as the

Â argmax of the summation. Where each of these theta K's is the log

Â of the potential 5000 which is basically the same as taking the table that

Â represents the factor and converting each entry into its logarithm.

Â And that gives you something that rather then being a product of factors, is the

Â sum of these log factors. And there's really good reasons for doing

Â that first, because summations are easier to handle than, than products.

Â But also because, if you're multiplying a whole bunch of very small numbers.

Â You're going to get numerical under-flows.

Â And converting this into summations is a much more robust algorithm, numerically.

Â In, in terms of a practical implementation.

Â And so now we're going to consider the problem of, that we just wrote here which

Â is the argmax. an expression that we're going to call

Â theta of X1 up to XN which is defined as the sum of factors over smaller scopes

Â which are these theta Ks. And this is the example that I meant to

Â show a moment ago. So this is taking a factor over scope a

Â and b and converting it to its log factor by taking log base two.

Â And you'll notice that the numbers were carefully selected to give me nice

Â outputs, but that doesn't have to be the case.

Â 3:06

So now that we've reduced problem to one where we are taking the max over

Â summation, we're going to now go back to our simple example of a chain.

Â And we're going to think about how we can do max sum variable elimination in chain.

Â And this is going to look. It almost identical to the max, to the

Â sum product algorithm that we had before, so let's, understand what we're trying to

Â do here. So we're trying, assuming that what we're

Â trying to do now is forget the argmax. Let's assume that we just want to find

Â the max of this, of this expression, theta of A, B,

Â B, D, and E. And so just like in the sum product case

Â we can break up the maxes in to a max over A and then a max over B and a max

Â over C and a max over D. And what we might not be quite as used to

Â doing this kind of operation that I'm just, that I'm about to show it's equally

Â valid to point out that because none of these guys say the two, say the three or

Â say the four depend on A, we can, add them up, the max over, this guy over

Â theta one and AB, after computing the max.

Â And so that's going to give me, a, factor here which which looks like max over A,

Â theta 1 of AB, and I'm going to call that lambda 1 of B, because notice that this

Â is not a constant, it's a function that depends on B, for different values of B,

Â your going to have different, different A's of maximize and different values of

Â the max expression. And so, this lambda 1 of B is simply the

Â max over A of theta 1 AB. And that process has effectively now

Â eliminated A, from this expression, and given me A maximization problem over one

Â fewer variable. only B, C, D, and E.

Â Just like in the context of sum product algorithms we can view all of these

Â operations as operations over factors rather than just thinking about them in

Â terms of expression. So whereas before we defined things like

Â factor product and factor marginalization, we can now define

Â analogous operations that correspond to factor summation and factor maximization.

Â So factor summation, a very obvious operation does exactly what we did here,

Â so who want in, in, in case of factor product, if we want to define.

Â And the row for a1b1c1 we're going to add up.

Â The entry three from A1B1 and the entry four from B1C1 and that's going to give

Â me 7. And I can similarly define all of the

Â other entries in the factor summation which we're showing here on the right.

Â Factor maximization, is the same way that we could sum marginalize a factor.

Â This is something that, max marginalizes. It's called the max marginalization and

Â what it does is it looks, if I, so I'm trying to get rid of B I have these two

Â rows here, say, for example, A1, B1, C1 and A1 B2 C1 that only differ from B,

Â And my new entry here A1 C1 is going to be the max of these two entries over here

Â which, in this case, is going to be seven.

Â And similarly for say, A1 C2, I'm going to end up with a max of 4.5 and

Â two, which is 4.5. So this is another form of

Â marginalization where I remove the variable b but using an operation other

Â than summation. So, now that we've defined these two

Â operations, we can go back and define the max sum variable elimination in chains as

Â a set of factor operations where I eliminate A, then B and so on and so

Â forth. So, for example, having eliminated A as

Â in the previous step I can now do the exact same operation by noticing that

Â factors that depend on B are theta 2B and 1B so these other two.

Â I can move outside of the maximization. And that's going to give me a max over b

Â of a factor that is the sum of the two factors, feta 2bc.

Â And lambda one of b that I got from the previous elimination step.

Â And that's going to give me a new factor lambda two of c and the process continues

Â in exactly the same way as variable elimination did for the case of some

Â product. And that's basically the algorithm.

Â So now let's think about what we get at the end of this execution for the final

Â factor. What is lambda four, having gone through

Â lambda two and lambda three, now we get lambda four of e.

Â What is lambda four for a given value little e?

Â 9:01

If we mandate. That E equal little E,

Â okay And so, this is a factor, and it gives me back value for each one of the

Â values, the score, effectively for each value little E of.

Â This is called a max marginal. In the same way that we took variable

Â elimination for some product and we use that to define a clique tree algorithm,

Â we can do exactly the same thing for max product, for, or max sum using the exact

Â same type of data structure. So, here we, we're going to use the exact

Â same clique tree and we have cliques over AB, DC, CD and DE, and we're going to as,

Â as usual assign the potentials, know the appropriate cliques using the

Â family preservation property. And now let's go ahead and see how

Â messages are passed in this clique tree architecture.

Â So, initially the AB clique is going to defines the message lambda 1, 2 which is

Â obtained by maximizing out A over theta 1 and the resulting message over B gets

Â passed to clique 2. clique 2 can take that message, multiply it with its own

Â factor theta 2 and add that, this is max sum remember, I'm going to add that to

Â lambda 1, 2 and that gives me the message lambda 2, 3 and the same thing can happen

Â when clique 3 passes a message to clique 4 and in this case over the scope D.

Â So exactly the same message passing process except using max and sum instead

Â of sum and product. We can equally well pass messages in the

Â other direction. So 4 to 3 over scope D, 3 to 2 and 2 to 1

Â and all of the properties that we said hold in the context of some product

Â clique trees hold here as well first notice for example that once this message

Â lamda one, two is sent it never changes again its its defined once and for all.

Â lambda two three, once it receives the message lambda one two, it too has

Â stabilized and will never change. And similarly from lambda three four.

Â So we can do one pass from left to right to compute all of the left to right

Â messages. And a similar pass right to left to

Â compute all of the right to left messages.

Â And as soon as we do that, they've all converged.

Â The second thing to notice is what is the value of this clique over here.

Â So if we look at this, we can see that when I've computed, when I finally get

Â all of the messages from both sides, I have integrated in theta 1,

Â theta 2, theta 3 which is stored in the clique itself, and theta 4 coming in from

Â the message on the right. And maximize out all of the variables A,

Â B, and E so that what I have left is a factor over the clique three, which is

Â the max over A, B, and E of the sum of all of the factors in this network.

Â So just to summarize once the clique I receives the final message from all of

Â it's neighbors except cj then that message is also final, will never change,

Â and the messages from all leaves are immediately final and so what we have is

Â an algorithm that at converges that converges after two passes and gives us

Â correct max marginals at every single one of the cliques.

Â So let's take a simple example just to convince ourselves that this is doing the

Â right thing. So, I'm, looking now at a much simpler

Â not, only have A, D, and C. So, there's two factors one over theta

Â one of A, D and theta two. Was b, c.

Â And first, we're going to construct the overall max feta, which is the sum of

Â feta one and feta two. So this is a, factor summation, went into

Â this expression, and let's look at it, and see what is, the map assignment, in

Â this example, and if we just look at the numbers that we computed, we see that the

Â map assignment is A1D1C1 with a value of seven.

Â 14:03

Now let's look at the message passings process that you would have on this very

Â simple cleet tree with two cleets. So this is Theta one, it's assigned to

Â clique one, Theta two is assigned to clique two and let's look at the two

Â messages that are passed. So here was my A1, so I'm, AB passes a

Â message to BC and that message is the max marginal.

Â Our max marginalization over, the variable A.

Â And so we can see that here we have a max between three and, N, negative one.

Â So we get three because this is the max over the two values of A that are

Â consistent with B1. And for B2 we have the max over zero and

Â one, so we get one. And exactly the same process gives us,

Â for this message. Where I'm max marginalizing C, I get this

Â message that's passed from left to right. Now, each of these two cliques takes its

Â message. Notice that I've gotten immediate

Â convergence here, 'because there's only one message to pass in each direction.

Â And so what I get here is the sum of this factor plus the incoming message.

Â 15:55

And I can do exactly the same operation to get this to get this a set, to get

Â this factor on the right, I get that by summing up this with that, and I get this

Â factor over here. And you'll notice that, miraculously the

Â MAP the max in each of these factors separately is A1B1 on the left and B1C1

Â on the right. So, sure enough, what I got here was the

Â most likely assignment consistent with A1B1 and over here the most likely

Â assignment consistent with B1, C2. And you'll notice if you go back and

Â check and I'm not going to do that right now, but if you go back and check this

Â big table over here you can convince yourselves that this doesn't only hold

Â true for A1B1. If you look, for example at A2 B1, you'll

Â see that what you get here, the value three is the value of the most likely

Â assignment that are consistent, that is consistent with A2 B1.

Â So actually let's go back and check that, so here is A2 B1.

Â We have two assignments consistent with A2B1 one is 0.5 and one is 3.

Â 3 is the best one and sure enough the value that we get here is 3 so it all

Â works. So what will we say about this algorithm

Â once it converges. The important think is that we can

Â complete beliefs of each cleek that represent exactly the max marginals of

Â that cleek. So how do we compute these beliefs and

Â now listening to what we had in the context of some product we look at the

Â factor assigned to the clique which is the theta i and we add to it the incoming

Â messages. Before you remember we had the factor

Â scientific cleet multiplied by the incoming messages because we were doing

Â products instead of semations. What does this what does this belief in

Â code this belief is exactly the max marginal.

Â So specifically, it, we can consider for any given assignment to the clique CI.

Â We can look at the best possible, the score of the best possible completion to

Â the clique CI. And that is the value of the belief at

Â that clique. So, it's the max over all of the

Â variables WI, that are not assigned to the clique.

Â One important consequence of the fact that we have max marginals is that we get

Â a, a calibration property here, as well. So, the cliques must agree on the

Â variables that they share. So to understand that let's look at these

Â two cliques that we have in the simple two clique example.

Â And here are the beliefs that we had for clique one and for clique two.

Â It's the sum of theta one plus the incoming message.

Â Over here. And, theta two plus lambda two.

Â And let's look at what the calibration property tells us.

Â It tells us that for example, if we look at.

Â The implications of this cleek regarding the variable b, we would have that the,

Â this cleek tells us that the best possible completion for b1 has a score of

Â seven and the best possible completion for b2 has a score of three.

Â This clique if we look at, it tells us that the best possible completion for b1

Â also has a score of seven and the best possible completion for b2 has a score of

Â three. So, these two clique agree regarding

Â there regarding the variable in the shared

Â scope which is the variable B. So to summarize, we can apply in the

Â context of max sum exactly the same clique tree algorithm used for sum

Â product. Messages are passed in the same way.

Â The clique tree is constructed in the same way.

Â The only difference is that the message passing operations use max and sum as

Â opposed to sum and product. At exactly as in some product convergence

Â is acheived after a single up and down pass, and the result of that is a set of

Â beliefs that represent max marginals at each of those cleeks where as a reminder,

Â the marginals tell us for each assignmnet what is the score of the best possible

Â completion to that assignment.

Â