0:00

So we've previously seen the notion of Pairwise, Markov networks.

Â but now we're going to define a much more general notion, that is considerably more

Â expressive than the Pairwise case. And that definition is called the Gibbs

Â distribution. So in order to motivate the notion of a

Â Gibbs distribution, let's look at the most expressive Markov network that we

Â could possible define in the context of pairwise interactions.

Â So here we have four random variables, a, b, c, d, and I've introduced all of the

Â possible pairwise edges between them. And so the question is, that we'd like to

Â ask ourselves is, is this good enough? So, is this fully expressive?

Â 0:49

Or in other words, can it represent any probability distribution over four random

Â variables. So one way to convince ourselves of, of

Â whether it can or it can't is to go and do the general case and just look at a

Â little bit of asymptotics. So consider a fully connected pairwise

Â Markov network over n random variables, and let's assume that each variable xi

Â has d values. How many parameters does the network

Â have, and for or the moment, let's just focus on pairwise interactions.

Â sorry, just the pairwise potentials. Let's ignore the potentials associated

Â with single denotes, and also, let's do a little bit of

Â analysis. If we have n variables, how many edges

Â are there in the network, how many parameters per edge?

Â 2:06

Right, can we represent any propability distribution, using O of N squared times

Â D squared parameters? How many parameters are there in a general probability

Â distribution over N random variables, where each has D values?

Â How many free parameters do we have? Well, this is much, much bigger than

Â that, which means that, if you just even

Â thinking about it intuitively without getting into formal arguement, Pairwise

Â Markov networks are not sufficiently expressive to capture all probability

Â distributions. So how do we increase, the, the coverage

Â of this undirected representation? We need to move away from Pairwise edges.

Â 2:55

So, in order to parameterize what we call a general Gibbs distribution, we're going

Â to parameterize it using general factors, each of which has a scope that might

Â contain more than two variables. So whereas before we just had factors

Â over pairs, now we have factors over triplets, quadruplets, and anything else.

Â Can we now represent every probability distribution?

Â Well, sure, because we can have a factor over all N variables together, and by,

Â you know if I immediately define, allows us to define a general probability

Â distribution. In fact we even said that a probability

Â distribution is a factor who's scope is X1 after XN.

Â 3:52

And we're going to define this distribution in two steps.

Â Three steps, even. The first thing we're going to do, just

Â like in the case of Pairwise Markov networks, we're going to take all of the

Â factors. So here we have k factors, and we're

Â going to multiply them. And this is just the familiar operation

Â of factor product, which we've seen in multiple contexts before.

Â Now, this is perfectly fine, but the problem is, that this is not necessarily

Â a probability distribution. In fact, it almost never will be a

Â probability distribution, because we have put no constraints on the factors, and

Â so, and so what we, that's why we have this tilde here, that highlights the fact

Â that this is what we called previously a unnormalized measure.

Â 4:58

which we get by summing up over all possible assignments, the variable of X1

Â up to XM. That partition function can then be used

Â to divide all of the entries in the unnormalized measure to give us, oops,

Â something that shouldn't have a tilde on it, but rather is a probability

Â distribution P sub phi if X1 up to XM. Now that was just a definition of the

Â distribution in terms of the set of parameters. where is the Markov network

Â in all this? So let's think about what is the Markov

Â network that we would like to have for a Gibbs distribution with a certain set of

Â factors phi? So in order to get the intuition let's

Â look at a distribution that innvolves two factors, phi 1, whose scope is A, B and C

Â and phi 2, whose scope is B, C, and D. And, I guess I'm going to use a different

Â color, for phi 2. So, what, edges should the mark up

Â network have when, when we wanted to encode the fact that A, B, and C all get

Â to interact with each other, together? And intuitively, what we then like to

Â have is a network that has an edge from A and B, an edge between B and C, and an

Â edge between A and C. Because that captures the fact that

Â there's direct probabilistic relationships between all of them.

Â What about the other one? Here we have, between B and C, C and D,

Â and B and D. Mm-hm.

Â 6:59

More generally, if we have a set of factors phi,

Â each of which has, or each phi I has a particular scope, DI.

Â The induced Markov network, which we're going to call H sub phi, has an energy

Â between a pair of variables, XI and XJ. Whenever there exists a factor phi I in

Â phi, such that, oops, phi, the phi such that XI, XJ are both in the scope of the

Â factor phi M. That is two variables are connected

Â whenever they appear together in the same scope.

Â 7:56

Now we can go ahead and turn this around and define a notion, just like we have

Â for Bayesian network, of when a probability distribution P factorizes

Â over a graph H, that is, at what point can I can I represent P over a particular

Â graph H? And, this basically asks the question of

Â is there a set of parameters phi, that are going to let me represent the

Â probability P. So, this is just of a straightforward

Â going through the definition that we had before. Does there exist a phi such that

Â P, is equal to P of phi where P of phi is defined as I, we defined previously as a

Â normalized product of factors. And such that H is the induced graph for

Â the set of factors phi. So P factorizes over H if there exists a

Â set of factors phi, such that I can represent P using those

Â factors, and H is the induced graph for that set of factors.

Â So that's when I can encode a distribution P over a graph H.

Â 9:25

So, now let's ask ourselves the question. if you give me a graph h.

Â Can I tell you, what the factorization of, a distribution p over that graph

Â would be? That is, which, which of the

Â representation of the distribution that I had in mind when I drew this graph?

Â So here we have this graph, over A, B, C and D.

Â And, which Gibbs distribution would induce the graph H?

Â Well let's look at them, one af-, one at a time.

Â So here we have phi one of, ABD and phi two of BCD.

Â And, we see that there's an edge in D, between A and B.

Â A and B, A and D.

Â And conversely between B and C, B and D, and, C and D.

Â So, the answer here is yes. This distribution would induce this, this

Â set of factors would induce this graph. Okay what about the next one.

Â 10:32

This, and ask yourself the same question. Well, if I wanted, AB would induce this

Â edge. B, C, yep.

Â C, D, yep. A, D, and B,

Â D. Well, huh.

Â Here's another distribution with a different set of factors that induces the

Â exact same graph. The third one is the same principle.

Â We have the edges A, B, D, A, B, B, B, and A, D.

Â And then we have B, C and C, D. So here's another distribution that

Â induces the exact same graph. What that tells us is that we cannot, and

Â this is important, cannot read the factorization from a graph.

Â 11:34

That is, we have different factorizations, that are quite different

Â than their expressive power, all of which induce the exact same graph.

Â And we've already seen an example of that.

Â When we had the fully connected Pairwise Markov network, we had one

Â parameterization that had old n2, squared d2 squared parameters.

Â And we had another parameterization that had a fully, a full factor over all n

Â variables, so it had vn to the n parameters.

Â And these are two very different representations, with very different

Â expressive powers that never the less induce the exact same graph.

Â But we must ask the question of why then is the graph the same?

Â What does the graph really tell us? Given that given that it's not telling us

Â the structure and factorization. So, here's is the, here is this, going

Â back to the example on the previous slide.

Â We have these, two factorization, one of which uses triplets, factors and the

Â second one uses Pairwise factors. And lets think about what is the flow of

Â influence in these factors. So when can one variable influence

Â another? And we can see,

Â and we think about this intuitively, when can B influence D?

Â Is this is this different in the two graphs,

Â in, in the two distributions? And the answer is, well, not really.

Â I mean once we have a factor. Here in this case it's phi one, in this

Â case it's phi five, that ties b and d directly, and the fact is that D can

Â influence B. What about can B, can A influence C?

Â Well, so let's, so can A influence C? A can influence C via D by going through,

Â in this case, the ABD factor, and then subsequently uti, utilizing the

Â dependencies within the BCD factor. and in this case, it can use the AB

Â factor. And then the CD factor, and so the point

Â is although the parametrization of the two distributions are different, the

Â paths in the graphs, the trails in the graphs through which influence can flow

Â is the same regardless of this finer grain structure of the factorization,

Â which is why the graphs in those two cases are the same.

Â So let's formalize this definition. were going to define a notion of an

Â active trail in in a Markov network. And this is actually a very simple

Â definition. It's much simpler than the analogous

Â definition in the context of Bayesian network.

Â We have that a trail going from X1 up to XN is active, given, and of, a set of,

Â observe variable Z. If basically no XI along the trail is in

Â Z, because an active trail has to, only flows through variables that are

Â unobserved. Once we observe a variable along the

Â trail, influence kind of stops because that variable is now set and so you can't

Â really influence it. And if you can't influence it, you can't

Â influence anything subsequently along that, along that path.

Â So for example, the trail from B. The D, is active, so this is active,

Â but not if a is observed. So once I observe A, I can no loner

Â influence, B can no longer influence D via a.

Â 15:13

So, to summarize, we define those in the Gibbs distribution,

Â which represents the distribution p as a normalized product factors.

Â we connected that to a graph structure which is the induced Markov network,

Â which connects every pair of nodes that are in the same factor, and the

Â motivation and although we noted that the Markov network structure doesn't, fully

Â specify the factorization of P, the justification for why the graphs for

Â different factorizations are the same, because the active trails in a graph

Â depend only on the graph structure.

Â