0:01

We define the belief propagation algorithm as a message passing algorithm

Â over a fairly general data structure called a cluster graph, and we showed

Â that if the cluster graph has certain properties, then so does the belief

Â propagation algorithm. What we're going to do now is we're going

Â to look at a special case of the belief propagation algorithm, where we pass

Â messages not only a general cluster graph, but rather over a data structure

Â called a cleak tree, which as we'll show as a special case, a cluster graph.

Â But interestingly, when we pass messages over cleak tree, we can show that the

Â belief propagation algorithm has much stronger performance guarantees than it

Â does over the general graph, and specifically we can show that not only

Â can we get convergence must faster, we can also get convergence.

Â To the exact answers for probabilistic inference.

Â So let's consider the case of passing messages tree, and let's start with the

Â simplest kind of tree which is just a chain. So in this case we have a chain A,

Â B, C, D, E and we're going to assume this is and undirected MRF or CRF, and so we

Â have a for simplicity we're only going to consider the Pairwise potentials without

Â loss of generality, and so we're going to have just Pairwise potentials here that.

Â For example, in the cluster a b, we're going to have the pairwise belief, the

Â pairwise potential over a b and so on. So.

Â so here we have si one. Side two.

Â By three and si four. So now let's think about the message

Â passing process, and we're going to start passing messages from the left.

Â And so psi, so cluster message the cluster two and so there is no incoming

Â messages at this point the message that get, that gets past is simply the sum

Â over A of psi one of AB. And that's a message who's scope is B.

Â Now we're going to have cluster two take that message and pass the message to

Â cluster three. And so now it's taking it's own

Â potential, side two. And multiplying in this message into

Â this. And the result is a factor over B and C.

Â And we're going to sum out over B. Because we only want to restrict to the

Â scope C. which is the shared scope between cluster

Â two and cluster three. And so delta two, three is this message

Â over here. Now we're going to pass messages, one

Â more time all the way down to the end. And so now, again, we're taking delta 2,

Â 3, multiplying it with psi three that comes from there.

Â And that gives us a message over scope D, which gets sent to cluster four.

Â 2:54

You could also pass messages in the other direction.

Â So this for example, would be what would happen if we passed messages from cluster

Â four to cluster three. regardless of when that message gets

Â passed, the, there's no other messages that

Â cluster four would multiply in when sending a message to cluster three.

Â Because, whether it got the message from three or not, it's not going to multiply

Â that in. Because, remember, you only send the

Â cluster a message that doesn't involve messages that you received from it.

Â And so, in this case, the message that you get is the, is the sum over E of the

Â factor psi four, which we get from here. Passing messages in the other direction

Â again, we see that the message that cluster three passes to cluster two.

Â Which is delta 3-2. Involves psi three, and this message over

Â here. And finally, exactly the same, idea.

Â 3:54

These are the messages that gets passed. Okay.

Â The one important property that you have when you run message passing in the way

Â that we just showed is that somewhat surprisingly relative to what we've seen

Â before the resulting believes that you get are correct.

Â And let's convince ourselves about relative to cluster three and so in the

Â order that we defined cluster one pass delta twelve to cluster two which was

Â then used to define delta 23 at the same time we had delta 43, and so the believes

Â that we get here are the product of cite three?

Â And delta two three is delta four three. Let's unpack each of those messages.

Â Though delta two three, as we remember was defined here.

Â And so we are going to rewrite what delta two three is as the summation over B of

Â si two delta one 2.. And similarly we unpack the delta four

Â three to get using this expression that we have over here right there.

Â 5:08

Now we're going to do one further level of unpacking.

Â And we're going to remind ourselves of what delta twelve is, which is as defined

Â over here. And so that gives us this expression over

Â there. And now look at what we have.

Â We have a product of all the four initial factors in the network.

Â Psi one, psi two, psi three, and psi four.

Â We have a summation over all the variables that are not in cluster three,

Â which are a, b, and e. C and d are in that cluster, so the ones,

Â we sum out over everything else. So when do we see that we have a product

Â of factors. Marginalized out.

Â 6:12

So this is really reminiscent of variable elimination.

Â And, if you recall when we talked about the correctness of variable elimination.

Â We said that variable elimination is correct, as long as you, as when you sum

Â out a variable, you first multiply in all factors that relate to it.

Â That involve the variable in their scope. So let's convince ourselves that, that in

Â fact was going on here so let's look first at the variable E and when we sum

Â out E we're multiplying we only are we're only involving si four but that's okay

Â because only the cluster four has E in its scope.

Â here we're summing out A. And once again only psi 1 involves

Â involves A in it's scope. Here we're summing out b, and really

Â there should be parentheses here. They were there implicitly before.

Â And so you can see that when I'm summing out b of involved side two, which

Â involves seeing b in its scope, as well as what was left over from summing out

Â side one, which was the other factor that has b in its scope.

Â And so you can see that I've marginalized out the variables in a legal order,

Â multiplying in only the expression multiplying in all of the expressions

Â that I needed to multiply before I did the summation.

Â And so this is a legal order of operations.

Â And therefore, it gives me a correct result at the end.

Â 7:49

So now, let's expand that into a somewhat broader into the somewhat broader case.

Â We're now going to define this notion called the clique tree.

Â And a clique tree is an undirected tree, kind of like a cluster graph was an

Â undirected graph. So, in this case, it's a tree.

Â Such that, once again, the nodes are clusters.

Â And, once again, we have edges between the clusters,

Â which, are called subsets. And in this case,

Â whereas before, we had the, the subset needed to be a, only a subset of CI

Â interception CJ. It could be smaller.

Â It could be equal or it could be smaller. Here, we're going to require that the

Â subset is actually the intersection of the two adjacent clusters.

Â 8:39

So now let's go and understand what properties the cluster, the cleak tree

Â that we showed, has, as a special case of a cluster graph.

Â So for a cluster graph we had the notion of family preservation, and the reason

Â that we had family preservation is because if we have a set of factors phi,

Â we needed to take each phi k in phi, and we needed to assign it to some cluster,

Â so that so that each case of information we have is represented somewhere.

Â And the factor that needed to be such that the cluster can accommodate its

Â scope, so that the scope of the factor was a

Â subset of a set of variables in a cluster.

Â And the same requirement holds here, because a clique tree is a special case

Â of a cluster graph. So that we have that for every factor,

Â phi K, there must exist at least one cluster whose scope is big enough to

Â accommodate that factor. The second property that we had in

Â cluster graphs was the running intersection property, and let me clarify

Â that the property that we see here is the cluster graph variant.

Â We're going to make it. Specific to clique trees in just a

Â minute. So the cluster and graph variant told me

Â before that for each pair of clusters C1 and CJ and a variable X that is present

Â in both of them, there needed to exist a path between CI and CJ for which all

Â clusters and subsets contain X, and it needs to be unique.

Â So what does that mean in the context of creet trees, let's assume, that we have,

Â C7, containing X, and C6, containing X, and, the cluster and the running

Â intersection property says there needs to exist a path between them, through which,

Â all clusters in sepsous contain ax. Well there's only one path in a tree

Â between C7 and C3, it's this one. And so what that property now tells me is

Â that x has to be here, here, here, and here.

Â Because the running, otherwise the running intersection property is going to

Â be violated in this case. So it means that in the contexts of

Â trees, the running intersection property can actually be written much more simply,

Â it can be read as saying that, pair of cos-tar ci and cj and, variable x in

Â their intersection. Each, cos-tar and subset and unique path

Â between them, must contain x, and so this is the just the instantiation of running

Â intersection property to the context of the trees.

Â 11:25

So now let's look at a, a more complicated clique tree.

Â This is the clique tree that corresponds to the networks that we previously used

Â to demonstrate variable elimination this is the elaboration of our student

Â network. And so this is unlike the previous

Â example, this is actually a clique tree for a Bayesian network as opposed to a

Â Markov network but as you'll see there's really no difference.

Â So here we have a set of factors. in this case, the factors are CPD.

Â And because of the, family preservation property, we can be assured that each

Â factor, must be assignable to some clique.

Â And so here, we have, p of c. And p of d given c are assigned to this

Â clique. P of G gives an ID, p of I and p of s

Â give an I and so on and so forth. So each of the CPD's we have is assigned

Â to one and exactly one of the cliques in this clique tree.

Â And, we can convince ourselves, and we can convince ourselves that this tree

Â satisfies for example the running intersection property.

Â So let's look for example at the variable G.

Â The variable G exists here, it exists here, and sure enough it exists here,

Â here, here, here, and here. And you can similarly convince yourselves

Â of that for any of the other variables in the squeak tree that the running

Â intersection property holds for them. Let's look at what message passing looks

Â like for this more elaborate clique tree. So it's a little bit, sli, only a slight

Â bit more complicated. So here we have, for example, delta.

Â so it's not even that much more complicated, so here's delta two, three,

Â for example, and delta two, three, which is the cluster betwe, which is the

Â message paths between cluster two and cluster three takes, thy two, which is,

Â the product of the factors assigned to this quee, there's actually only one in

Â this case so it's only PG gets and IND, so that's thy two, multiplies it with the

Â incoming message delta one, two and sums out, over D.

Â The only slight difference between this and the previous example that we had is

Â that here for example when cluster four passes a message to cluster three, we can

Â see that over here there's two variables, J and L, both of which must be summed out

Â in order to produce the message over the requisite subset.

Â 14:01

Now, we preview,, we defined the running interception property. One of the really,

Â nice aspects of the running interception property, is that it, by itself, together

Â with family preservation, suffices to prove the correctness of message passing

Â in a clique tree. And the reason for that is the following.

Â if so let's consider a variable x over here.

Â And, let's and what we want to prove is that

Â 14:41

is that if we eliminate X at some point in the tree, at some point in the message

Â passing process. Then we have multiplied in all factors at

Â involve X. And, so, here we have, for example, X

Â here and here, and by the running intersection property, we have the X is

Â here and here, as well and on this axis. And, let's imagine that X is eliminated

Â on this edge. That is the message that's passed through

Â C3 to C5 doesn't the message is passed from C3 to

Â C5, doesn't involve X, and so X must be summed out at this point.

Â 16:03

And the reason we know that if X is in C5 then X is in S3, 5, is because of the

Â running intersection property. So the running intersection property

Â relates the structure of the graph to the structure, to the the time at which

Â variables are eliminated. So now let's convince ourselves of the

Â correctness in this slightly more complicated example that, that we've seen

Â satisfies the running interception property.

Â Here, once again, we have, psi 1, which is the product of these two

Â factors. Psi 2, which is this one, psi 3, which is

Â these product of these two factors. Psi 4, psi 5.

Â 16:57

And once again let's, for example, consider what we get at cluster three.

Â And we can see that we have multiplied in all of the potentials, side one, side two

Â side three, side four and side five. And that we have eliminated all of the

Â variables except for g, s, i. So, here we've eliminated c, d,

Â j, l, and h, leaving us only with the variables g, s, i, which we want to keep.

Â And we can similarly convince ourselves, as we did before, that when we eliminate

Â a variable, we've multiplied in all of the factors that involve that variable.

Â