An alternative class of algorithms to the variable elimination algorithms is the class of Message Passing algorithms. And, as we'll see, this is, a class that is in some ways closely related to variable elimination. It also offers us additional flexibility in how we do a summation and, factor product steps so as to potentially come up with a lower complexity than would be required by even the minimal elimination ordering. So lets consider simple mark of network of the following type and lets assume that I don't want to go to the cost of eliminating variables although in this network its not going to be to expensive. And instead what were going to do is we're going to construct what we call a cluster graph. A cluster graph is something where is what, is a data structure in which we're going to take little bits of knowledge from this graphical model and we're going to place them in things called clusters. So here, we have four clusters in this example. Cluster one is a cluster who's jurisdiction of influence is the, the pair of variables AB. Cluster two has jurisdiction over B and C. Three is over C and D. And four is over A and B. And these clusters are going to sit there, and they're going to talk to each other, and they're going to try and convince each other that what they think about a variable that they both consider to be under their jurisdiction is, is correct. So for example cluster one is going to talk to cluster two about the variable B, and it's going to tell cluster two what it thinks about B, so that cluster two might become more informed about the distribution over B. And so what we're going to do is initially each cluster is going to have it's own little piece of information so phi one is going to go here, phi two is going to go there, phi three goes there and phi four goes there. And now the variables are going to communicate by via these things called messages. So we're going to call, we're going to slightly rename things. We're going to call the variable, the initial set of beliefs, if you will or evidence that a factor, that a cluster has over the variables in this jurisdiction. We're going to call those psis. In the example, we just had psis where just the psi in the original model, but as we'll see sometimes it can becomes a little bit more complicated than that. So now factor, the cluster one has the factor Phi one. And if cluster two has psi two and so on, and now let's imagine that Si one wants to send a message, oh sorry, this psi two, the cluster two wants to send a message to cluster one. So it has to figure out, what it believes and our, our priority we're going to assume that it's just going to, we're going to start out by initialising with a totally uninformed message. So, because initially they haven't even started talking to each other. So all messages are initialized to be one. But now cluster two can come back and say, okay I'm going to ping the information, uninformative as it was that I got from cluster two, and notice that I call this message delta two one. Delta two being the from and one being the two and so if taken delta two, one and now factor one eh cluster one is going to say well I'm going to take that I'm going to multiply it with my current thoughts about the variables AB and that's going to give me a more informed factor. And now, I'm going to communicate that information to factor four, to cluster four. But cluster four doesn't care about a, sorry cluster four doesn't care about b, and so what I'm going to communicate to cluster four which is this message delta one four from one two four is the sum over B, which is the variable that four doesn't want to hear about. The incoming message. Times my initial beliefs, of my initial factors. Now, this general process is what keeps the message-passing algorithm going. So each variable. Each cluster is going to send messages to its adjacent clusters that reflect this exact same process. So for example, just take a different different example here is delta three four. So this is the message that goes from three to four. And that message takes into consideration the evidence that two sent to three, which is this guy over here, times whatever three thought about c d to begin with. notice, and this is important, that the message that three sends to four doesn't take into consideration the information that it got from four. So there isn't in this expression over here the contribution of delta four, three. Because you want to have, you want to avoid this case of I repeat back to you a rumor that you just told me and we all become more convinced about the truth of this rumor, because oh, you, I, I thought about this first, but now you're reinforcing me by telling it to me again and the beliefs are just going to go up and up and up. And so what happens here is that we deliberately only, restrict attention to evidence that. Comes in from other sources. So three only uses evidence from two when reporting to four, and it only uses conversely evidence from four. One reporting to two. And so now this is, 'em, this defines a set, a communication protocol by which one factor or one cluster rather in the graph can communicate information, to its neighbors. What do we do with this so how do we generalize this message passing process? Let's construct a more general version of this. so, this uses a data structure called a cluster graph. A cluster graph is an un-directed graph, whose modes are not variables anymore. Not as, not in, this is not a you know, graphical model of the type that we've seen. The modes in this un-directed graph are clusters that correspond to subsets, of variables. Just like we had before. And we're going to connect to adjacent, to clusters c I and c j. And, this, this thing called the subset is the variables that they choose to talk about. And clearly each one can only talk about variables that it knows about, which is why the subset SIJ has to be a subset of both CI and CJ. So once again CI is the jurisdiction of cluster I. These are the variables that it understands. And SIJ is the communication between two adjacent clusters in the cluster graph. So now given a sent of factors Phi we're going to initialize the model by giving each, each cluster in the graph a certain amount of information. So each of my initial clusters, each of my initial factors Phi K in my graph is going to be assigned to one and only one cluster. And this is important, there needs to be at least one so that the information is taken in to account somewhere and it shouldn't be more than one because if you give it to more than one pers, to more than one cluster they're going to, that your going to double count the evidence, they each going to think that independent piece of evidence and it's going to be counted twice. Now where do we put the information corresponding to factor K? We put this only in a fact, we can only put it in a cluster that understands every single variable in that factor. So if we have a factor who's scope has a certain, that has a certain scope that scope better be a subset. Of the variables that the cluster understands. Because otherwise we can't even talk about those variables. So, So once we've done that, if I can erase this, okay. We can now define the initial, beliefs, of a particular cluster. As, the product, of all, of the factors, that are assigned to it. Now some variables might, some clusters might be assigned one factor, in which case psi is just equal to that phi. Some, that's, that's, that was the case in the example that we just saw, some clusters might have several factors assigned to them, in which case we need to multiply them to create a single factor that is sort of the total informed beliefs of the cluster. And some clusters might have no factors assigned to them, in which case this is a null product and is equal to one. So now let's look at an example of a somewhat more interesting cluster graph than the trivial one that we showed earlier. Here we have a set of factors. phi one of a b c, phi two of b c, phi three, phi four, five, six, and so on. So initially we have to figure out for each of those factors a cluster in which to put it. For a b c, there's really only one choice because there's only one cluster in this entire graph that understands about all of a b and c, and that's this cluster over here. D C, however, has two choices, it can go into cluster one or it can go to cluster two because both cluster one and cluster two understand the about d and c. We're going to put it in cluster two. I mean, we could've put it in either one. Both are fine. Five, three has, again, two choices. It can go into cluster two, or it can go into cluster three. We're going to go ahead and make the decision to put it in cluster two. cluster five four goes, has only one choice because only one cluster has both D and E in its scopes so it goes here. Five five, similarly, only over here. B D. Again, there is more than one choice. We can put it in, cluster two, or we can put in cluster three. Let's, for simplicity, put in cluster three. And BDF, only one choice. This is one possible way of assigning the cluster, the factors to clusters. There's other alternatives, as I said, that would work. If we do this, we end up for example, with psi two. Being the product of phi two times phi three. Whereas psi one is simply equal to phi one. And psi three. Is equal to si to five, six times five, seven. here is a different assignment of the same factors, to different, to the clusters. And, we can see that it it's also equally legitimate. Okay. This was one cluster graph for those set of factors. Here's another cluster graph. So let's compare them. One, two, one, two. For different, for for the exact same set of factors. And notice that it's still, notice that in fact the clusters haven't changed. What's changed is the edges and the subsets between them. Different cluster graph. Okay. So now, let's think about message passing in the context of this more richly structured cluster graph, to see what it looks like here. So here, for example, if we're interested in passing a message from cluster one to cluster four. We are going to take sign one which is the factor the set of factors that were initially signed to cluster four. We're going to take in the message that this, cluster got from it's other neighbor two. We are going to multiply them and then, we are going to sum out all of the variables that one understands but two doesn't. So, here for example, one understands, a and c and two doesn't so we have to sum out over a and c and that is the message delta one four. What about delta four one? Delta four one is the message, goes in the other direction. And here notice that four gets message from all sorts of other clusters, so in addition to it's original factor psi four it gets a message from cluster two. It gets a message from cluster five, and it gets a message from cluster, three, each over, it's own scope, all of these are going to multiplied together to give the current most in form beliefs about, about, the, the variables B and E, which are then, marginalizing, over E, which cluster one doesn't understand, to produce a message over B. And once again, we know that the message from one is not used to inform the message that four sends back. So that gives us overall the following expression for message passing, between cluster I and cluster J. So delta IJ, over the scope of this, which is the subset SIJ. And that has the following general expression. It takes the factors initially assigned. The cluster I. Multiplies all of the incoming messages pther than some j. Multiplies it with all of the other sums out the variables that cluster J doesn't know about. So that everything that's in the scope of CI, but not in the scope of CJ. And that gives us a factor over the subset that is produced as a message. But putting that together that gives us an algorithm which is generally called belief propagation for the reasons that you, that clusters are propagating, if you will, informed beliefs to each other. And here is the summary of the algorithm Each factor, phi, is first assigned to a cluster. That is used to construct our initial potentials. These PSIs that we talked about. We initialize all of the messages before anybody starts communicating to B1 and then we repeat the following process. We select some edge in the graph, between adjacent clusters and we pass the message between cluster I and cluster J over that edge, okay? And we repeatedly, and this is the expression for that message it's this, the one that we saw in the previous slide and that process is repeated again and again. At the end of the process, we now, a cluster needs to know what to believe, about the variable that are understand. And so it takes all of the it takes its own additional beliefs. And all of the information that it got from all of its neighbors. Multiplies it all together and that produces these things which are called leafs. Now there's several important aspects of this algorithm that are left undefined. And that we're going to need to talk about later on. The first of these is repeat until when. When do we decide that we're done and that we can stop. And we'll talk about that in the context of different variants of this algorithm later on. The second thing that's left undefined, is how I select each of the edge, how do I select the edges? Which edge do I select to pass messages over? So here there is you know the obvious simple things which is just to go in some pre specified order. and that's one option, is round robin passing. But it turns out that there are better strategies than that and we'll talk about that too. So is this algorithm any good? Well it turns out that you know, yes and no. So first of all if you pass messages over a graph like the one that I showed before like the little, a, b, c, d loop, eventually you see that we achieved convergence, so this is, here we can see it eventually, we do converge the sum probability but that probability is a little bit off. That is the answer is not exact. So, in general, with exceptions that we will talk about this is an approximate. Algorithm. Which shows that there's no free lunch, if the problem was n p hard, it's not like I have an easy way of solving it. So given all these problems with the belief propagation algorithm, what makes us think that it's an effective algorithm to use? When the algorithm was rediscovered in the 1990's, Murphy, Weiss and Jordan, as well as others, looked at its performance in the context of the real networks. And specifically networks that have a lot of loops, which is what causes the belief propagation algorithm to misbehave. And so here is an example network, it's it's called the pyramid network, it's a network that is analogous to one that arises in image analysis. And what we, what they showed is that when you compute on the one hand the exact marginals on the X axis and for different marginals in the network. And on the Y axis, we see the marginals compute loop belief propagation. We see that the marginals, by and large, sit almost exactly on a straight line, with few exceptions. Showing that the loop belief propagation is very close to accurate on this network. Here's another network. This is a simplified version of a medical diagnosis network. And once again, we can see that the, when we compare the correct marginals on the x axis to the belief propagation marginals on the y axis, the marginals sit almost exactly on this straight line, which shows that again, they're very close. So the belief propagation's very close to accurate. So to summarize the belief propagation algorithm passes messages over a graph of clusters that are connected to each other via subsets. The adjacent clusters pass information to each other in these messages. transmitting information only about the variables in the subset, which are the ones that they have in common. The message that cluster i, sends to cluster j summarizes everything that i knows, about, the variables in the subset. Except, for information, that it, obtains from j, so that one avoid, direct double counting, of the same piece of evidence where j gets its own information back again via i. We've seen that the algorithm may not converge and it can have this oscillatory behavior. and that the resulting beliefs are pseudo-marginals in that they are not necessarily the exact marginals from the theoretical perspective. But nevertheless, as we've seen, the algorithm actually performs quite well in a range of practical applications, which is why it's quite commonly used.