0:01

We've previously shown that the graph structure encodes a set of independencies

Â and that that set of independencies then necessarily holds for every distribution

Â that can be encoded as a Bayesian network over that graph.

Â What we want to talk about now is the question of how to take a distribution

Â that has a certain set of independencies that that it satisfies and encoded within

Â a graph's structure. How well can we take that distribution

Â and capture its independencies in the context of a graphical model?

Â So first, let's understand what the independencies of a distribution are.

Â So, we're going to define this notion of I of P for a distribution, P,

Â as the set of all independent statements. X is independent of Y given Z,

Â that hold for the distribution P. So, one can write down potentially an

Â exponentially, large set of independent statements and each of those is going to

Â be either true in a given distribution P or not.

Â And what we're going to do is we're going to define I of P to be the ones that are

Â true in that distribution. Here, so these are independencies that

Â hold in P. And we've already seen that if P

Â factorizes over a particular graph G then G is an I-map for P.

Â Which means that every independence that holds in G, that is, it's implied by the

Â d-separation statements or d-separation properties in G, every such independence

Â also holds in P. But that doesn't mean that the converse

Â holds. That is, there can be independencies in,

Â that hold in P that are members of I or, I of P, that are not encoded by the graph

Â structure. So,

Â why did that matter? Because if we have a graph that doesn't

Â capture some of the independencies in P, then it's unnecessarily complicated.

Â And conversely, the sparser the graph, the more independencies it encodes, then

Â the sparser it is, and therefore, it has fewer parameters that we need to elicit

Â or learn. fewer edges also means that inference is

Â more efficient. And finally, it means that the graph is

Â intrinsically more informative about the properties of P.

Â So, we would like graphs that capture as much of the properties, independence

Â properties of P as possible. So, what do we want in terms of sparsity?

Â Something that is fairly basic that we might choose to require is what's called

Â the a minimal I-map. That is, we want an I-map for the

Â distribution P that at least doesn't have redundant edges.

Â So, for example, if we have a graph X, Y, where Y really doesn't depend on X in

Â the sense that P of Y given, say, X0 in the CPD is equal P of Y given X1,

Â then we could remove this edge and still have something that was an I-map for this

Â distribution. And so, this edge is now redundant.

Â And so, from our definition, that would mean that this I-map is not minimal

Â because you can have an I-map from which edges can be removed.

Â So, that might seem to be a reasonable strategy, but it turns out that it's not

Â sufficient in terms of ensuring that our graph is as sparse as it might be.

Â So, to understand why that's the case, let's look at a distribution such as that

Â corresponds to this example that we've seen before where the students grade G

Â depends on these two variables, I and B. So that, in fact, is a minimal I map for

Â a distribution where D and I are independent and G depends on both of

Â them, but it turns out this is not the only minimal I-map for this distribution.

Â A different minimal I-map is this one, where we have an edge that goes from D to

Â I, from D to G, and from G to I. Let's convince ourselves that, in fact,

Â this is a minimal I-map. Let's try and remove any one of those

Â edges, and see if we still get a I-map for the original distribution.

Â So if, for example, we try and remove this edge that would in, that would

Â correspond to a statement that D is independent, of G,

Â which is certainly not the case in our original distribution.

Â So, that one doesn't work. What about the second one?

Â If we try and remove the second edge from G to I, that would correspond with the

Â statement that D, alright, that G is independent of I given D, which is also

Â not the case in our original distribution.

Â 5:37

What about the final edge? The one that doesn't correspond to an

Â edge in our original network, this one. And we remove that one,

Â that one seems the most promising of the lot but if you remove that one, it would

Â correspond to the assumption that D is independent of I given D.

Â And that, and that is exactly the dependence that is induced by intercausal

Â reasoning that we've seen before. So, in fact, this is a minimal I-map,

Â none of the edges can be reduced. There is a better minimal I-map but this

Â is a minimal I-map. So, minimal I-maps are not necessarily

Â the best tools for capturing structure in a distribution.

Â What we'd really like is a perfect map. A perfect map is one where the

Â independencies in G exactly correspond to the independencies in P, so the G

Â perfectly captures the independencies in P.

Â And sure enough, if we can get that, that would be ideal.

Â Unfortunately, perfect maps are often hard to come by.

Â So, here's an example of one scenario that doesn't have a perfect map.

Â This is a distribution P that is actually represented by the pairwise Markov

Â networks that we've seen before. So one where we have pairwise

Â interactions between AB, BC, CD, and AD. So, we know that this this distribution

Â satisfies certain independencies because of the Markov network properties

Â specifically we know that A is independent of C given B and D because B

Â and D separate A and C in the graph. And at the same time, we know that B is

Â independent of D given A and C. So now, that we, so let's imagine that we

Â have a distribution P that satisfies these independencies.

Â So, P satisfies this pair of independencies.

Â And and now, let's try and encode those independencies using a Bayesian network

Â as a perfect map. But let's try.

Â One possible attempt would be to just try and direct the edges, say, this way.

Â 9:52

And that's it, in terms of independencies.

Â And so, this, in fact, is an I-map for the distribution because in this case, I

Â of G is a subset of I of P, which is this set over here.

Â But it doesn't capture all of the independencies, it only captures one out

Â of the two. So, this is the distribution that does not have a perfect map of a

Â Bayesian network. Let's come up.

Â Let's provide another example of an imperfect map.

Â And, in fact, this is a this is a distribution that doesn't have an

Â imperfect map as either a Bayesian network or, in fact, as a Markov network.

Â and this is the famous example of the XOR, which is, as we'll see, a

Â counterexample to many, many things. So here, we have two random variables X1

Â and X2, which we're assuming are binary valued and say,

Â each of them is, takes the value of 0,1 with the probability of 50, 50.

Â Why, on the other hand, is the exclusive or of X1 and X2.

Â Which means that Y is equal to one, if and only, if exactly one of X1 or X2 is

Â equal to one, okay?

Â So, let's look at what this probability distribution P looks like.

Â Here it is. we have that there's four possible

Â configurations that have nonzero probability.

Â and each of them has equal probability of 0.25.

Â So, X1, X2 can be either zero or one with probability 50,

Â 50 and here's the value of Y, over here. But now, let's think about other so let's

Â think about what independent statements are true for this distribution.

Â So, most obviously looking at this graph, we see that X1 is marginally independent

Â of X2. But if you close your eyes on this part

Â of, this, of the image, and just look at the right-hand side, you'll see that

Â really X1, X2, and Y are all symmetrical in terms of their structure.

Â And so, it's not difficult to verify that we also have the X1, in fact, is

Â independent of Y, and that X2 is independent of Y.

Â And all three of these are pairwise independencies hold in this distribution.

Â And so, this is not, in fact, the graph on the left is not a perfect map for this

Â distribution because their independencies that hold in P are not visible in the

Â graph. And, in fact, you can organize the nodes

Â in this graph in any which way and, but you cannot get all through these

Â independencies captured at the same time because the only way to do that, would be

Â to have X1, X2, and Y be separate variables and, of course, that wouldn't

Â be an I-map for the distribution. Now, we've talked about Bayes nets as a

Â perfect map. What about Markov networks as a perfect

Â map? The definition here is the same, except

Â that we've replaced G by H, so a Markov network H is a perfect map if the

Â independencies included by H are in, exactly correspond to the independencies

Â in P, and can we capture all possible distributions in terms of a Markov

Â network, is a perfect map. So, I'm sure you're all expecting at this

Â point for the answer to be no and sure enough, that's true.

Â so here, is an example of a distribution that has a perfect map, in this case, is

Â a Bayesian network, but not as a Markov network.

Â So, this is the famous V structure example, in this case the, difficulty

Â intelligence [UNKNOWN] structure. And let's think about what, in the, what we

Â would need to encode as, in terms of edges or, for being an I-map of this

Â distribution. So clearly, we need to introduce an edge

Â between B and G and between I and D because it's certainly not the case that

Â we can make D and G independent given I or vice versa.

Â And so, this would be the obvious I-map for this distribution.

Â But this example, if we choose this as a, as a candidate I-map, it would imply,

Â among other things, that B is independent of I given d.

Â And that, of course, is exactly wrong because we know that when we condition on

Â G, D is indepedenet, D D and I become dependent.

Â And so, the only I-map for this distribution is one that has all three

Â edges and that loses the independence that we had over here which is D is

Â marginally independent [UNKNOWN]. So once again, there's no perfect map for

Â this distribution as a Markov network. The final question that we might ask

Â ourselves is the extent to which a representation of a distribution is, in

Â fact, unique? And so, say, that we could represent the

Â distribution using some graphs, say, as a perfect map,

Â is that a unique representation? So, to understand that, let's look first

Â at the simplest example, the one where we have two variables X and Y. And here, we

Â have one graph that captures in this case, no independence assumptions so I of

Â G is equal to the empty set, H1.

Â And here is G2, it looks the same except that the edges are inverted, the one edge

Â is inverted, and once again, this is a different graph that has adopted the same

Â value set of independencies. So, we can see that here, we have two

Â distinct graphs that have the exact same of independence assumptions.

Â And because of that, they can represent the exact,

Â 16:07

exactly the same set of distributions. Which, in this case, is all distributions

Â over the variable X, Y. Now, one might think of this as a

Â degenerate example because it's a fully connected graph but it turns out to be

Â not the case. There are many other situations where two

Â distinct graphs with edges going in different directions represent the exact

Â set, same set of independence assumptions.

Â So, for example, when we look at graphs over three random variables, it turns out

Â that of these four scenarios one of those is, represents a different set of

Â independence assumptions than all the others but all of the others are the

Â same. So, which is the odd man out?

Â Which of these following graphs does not encode the same independence assumptions

Â as the others? As I'm sure most of you realized, the

Â answer here is the V structure which is one that we've seen before.

Â And so, this one has the independence assumption X is independent of Z

Â marginally whereas, these other three have the independent assumption that X is

Â independent of Z and Y. So, these three graph,

Â again, represent the same set of independence assumptions, which is this

Â set over here. And the set, they also can take any

Â probability distribution that can represent any one of these graphs can

Â also be represented by another, by the other.

Â So, the formal notion for this, the formal term for this notion is what's

Â called I-equivalence. Where two graphs, G1 and G2 over the same

Â space are said to be I-equivalent if they make the exact same set of independence

Â assumptions. And in the previous example, we saw for

Â example that X, Y, Z is I-equivalent to say,

Â Y being a parent of both X and Z, and in this third one over here.

Â Whereas, the V structure is not I-equivalent.

Â Now, why is I-equivalence an important notion?

Â It's important because it tells us that there's certain aspects of the graphical

Â model that are unidentifiable. Which means that if we end up for

Â whatever reason thinking that say, this is the graph that represents our

Â probability distribution, it could just as easily be this one or that one.

Â So, without prior knowledge of some kind or another for example that we prefer X

Â to be a parent of Y, there's no way for us to select among these these different

Â choices. And it turns out that this is not an

Â exception, rather it's the rule.

Â Most graphs have a large number of I equal, I-equivalent variance.

Â And that turns out to be a complicating factor,

Â especially when we get to learning models from data.

Â So, in summary we prefer to have graphs that capture as much of the structure in

Â I of P as possible because they're more compact.

Â therefore, are easier to learn and easier to inference with.

Â And also provide us with more insight about the distribution.

Â We talked about minimal I-map as one option for sparse graphs, and we showed

Â that that's not a particularly good option because it might fail to capture

Â structure even if it's present in the distribution and even if it's

Â representable as a Bayes net or as a graphical model.

Â