0:00

One of the most elegant properties of probabilistic graphical models is the

Â intricate connection between the factorization of the distribution as the

Â product of factors. And the independence properties that it

Â needs to satisfy. Now we're going to talk about how that

Â connection manifests in the context of a directed graphical models or Bayesian

Â networks. So, let's first remind ourselves about

Â why independence and factorization are related to each other.

Â So, for example, the independence definition that P of X, Y is the is the

Â product of two factors P of X and P of Y is the definition of independence.

Â And at the same time it's a factorization of the joint distribution as a product of

Â two factors. similarly one of the definitions that we

Â gave for conditional independence, which is the, the joint distribution over X, Y

Â and Z is a factor over X and Z times a factor over Y and Z, is the definition of

Â conditional independence. So, once again, independence of

Â factorization. So, we see that factorization of the

Â distribution corresponds to independencies that hold in that

Â distribution, and the question is if we have that if so if we know now that a

Â distribution P factorizes over G. The question is can we know something about

Â the independencies that the distribution P must satisfy, just by looking at the

Â structure of the graph G. [COUGH] So what are independencies that

Â need that might hold in a probabilistic graphical model.

Â So we talked about the notion of flow of influence in a probabilistic graphical

Â model where we have for example the notion of an active trail that goes

Â through as up through I, down through G and up through D, if for example we have

Â the G is observed, that is in Z. And that gave us an intuition about which

Â what problistic influence might flow. We can now turn this notion on its head

Â and ask the question well what happens when we tell you that there are no active

Â trails on the graph that is influence can't flow.

Â So we're going to make that notion formal using the notion of d-separation.

Â And we're going to say that X and Y are deseparated in a graph G given a set of

Â observation Z if there's a lack of trail between them.

Â And the intuition that would like to demonstrate in this context is that this

Â notion of influence can't flow corresponds much more formally to the

Â rigorous notion of conditional independence in the graph.

Â So let's actually prove that that's in fact the case.

Â So the theorem that we'd like to state that if P factorizes over the graph,

Â then P factorized is over G. And we have a deseparation property that

Â holds in the graph. So X and Y are deseparated in the graph.

Â There's no active trails between them. Then P satisfies these conditional

Â statements, X is independent of Y given Z.

Â So deseparation implies independence. If, the, probability distribution

Â factorizes over G. So, we're now going to prove the theorem

Â in its full glory. We're going to prove it by example

Â because that example really just, it really does illustrate the main points of

Â the derivation. So, the assumption is that here is our

Â graph G, and here is the factorization of the distribution.

Â So, according to the chain rule, of Bayesian networks.

Â And this is a factorization that we've seen before.

Â And so now we'd like to prove that a d-separation statement follows from this

Â from this deriva-, from this assumption. And, and the d-separation statement that

Â we'd like to prove follows as an independence.

Â Is one that says that d is independent of S.

Â First, let's convince ourselves that D and S are in fact d-separated in this

Â graph. And so we see that there is only one

Â possible trail between d and s in this graph.

Â It goes. That instance g is not observed in this

Â case, and neither is l. We have the, the trail is not activated

Â and so they, the two are, the two nodes are de-separated and so, we'd like to

Â prove that this independence holds. So what is the prob, the joint

Â probability distribution, of D and S? Well it's the sum over this join

Â distribution over here. Marginalizing out the three variables

Â that are now D and S so GL and I so we have the sum over GL and I of this big

Â product that's defined by the tin roll. And as we've previously seen when we have

Â a summation like that one of the things that we can do is we can push in

Â summations into the product so long as we don't push them through terms that

Â involve the variable. So here for example we have we might have

Â a summation over L. And we can push in that summation because

Â the only term that involves L is the probability of L given G.

Â So if we push in the summations in this way we end up with the following

Â expression and we see that we have this summation over L here, a summation over G

Â and then finally the summation over I at the very beginning.

Â And what do we know about the summation over L of P of L given G?

Â We know that this, that this is a probability distribution over L and

Â therefore. It's necessarily sums to one, because

Â we're summing up over all of the possible values of L.

Â Once this term is one, we can look at the next term, which is this one.

Â And we can ask ourselves, well one is, cancels out, so probability, what is the

Â sum over G of the probability of G given D and I, and that too is one.

Â And by the time we're done, we end up with the following expression.

Â And the important thing to notice about this expression is that this is phi one

Â of D and this is phi two, oops. I two of S, and therefore because we

Â partitioned this joint distribution as a product of two factors, we end up with

Â something that corresponds directly to the definition of marginal dependence,

Â therefore proving that P satisfies this independent statement.

Â Having convinced ourselves that de-separation is an interesting notion,

Â let's look at some other de-separation statements that might hold in a graph.

Â So, one general statement that we can show is that any node in the graph is

Â de-separated. From it's non-descendants given it's

Â parents. So let's look at the variable letter.

Â And let's see what non-descendants, non-descendants it has that are not it's

Â parents. So it only has two descendants let's mark

Â those in blue it's this one and that one and what are the non-descendants so these

Â are the descendants. What about the non-descendants?

Â Well, here's one non-descendant, here's another, another and another.

Â So, those are the four non-descendants in the graph that are not parents.

Â So, now let's convince ourselves that, letter is de-separated from all of it's

Â non-descendants given it's parents and let's take just an arbitrary one of

Â those, let's take S A T for example and let's see if we can find an active trail

Â from S A T to letter that is active given the parents, which, in this case, is just

Â a variable grade. Okay.

Â So what are, what's one potential trail? So we can go this way, up.

Â And then back down through water, but is that trail active?

Â No its not active because Greg is observed and so drop with blocks a trail

Â in fact it will block any trail that comes in from the top.

Â So any trail into letter that comes in through its parents grave is going to be

Â blocked by the fact that grade is observed.

Â So that means we have to come in through the bottom.

Â So let's look at that. So let's look at the trail from s a t,

Â through job. An up, again through letter, well is that

Â trail active, no. The reason it's not active is because

Â only great is observed, and job, since it's not, it can't be a parent of, our

Â letter, because it's a descendent, is necasserily not observed and so neither

Â job nor happy, are observed and so we, we can not activate, this V structure, over

Â here, can not be activated and so, trails that come in from the bottom, don't work

Â either. And so we again this is a proof by a

Â demonstration but it shows again that that.

Â It shows the general property in a very simple way.

Â So that tells us by the theorem that we proved on the previous slide that if P

Â factorizes over G, then we know that any variable is indepedent.

Â Of it's non-descendants given its parents, which is a kind of nice

Â intuition when you think about it. Because when we motivated the structure

Â of Bayesian networks we basically said that what makes us pick the, the parents

Â for a node is the set of, of variables that are the only ones that this variable

Â depends on. So the parents of are the variables on

Â which depends directly. And this gives a formal semantics to that

Â intuition. That is we now have the variable is

Â depending only on its parents. So, now that we've defined a set of

Â independencies that hold for any distribution that factorizes over a

Â graph, we're now going to define a general notion, which is just a term for

Â this. So we define d-separation in a graph G

Â and, and we showed that when you have the d-separation property, that P satisfies

Â the corresponding independence statement. And so now we can basically look at the

Â independencies that are implied by the graph G.

Â So I of g which is the set of indepedencies that are implicit in a

Â graph G are all of the independent statements X is independent of Y given Z

Â that correspond to d-seperation statements within the graph.

Â So this is the set of all indepedencies that are derived from d-seperation

Â statements and those are the ones that the graph implies hold for distribution P

Â as we just showed in the previous demo. So, now we can give this a name.

Â We going to say that P if P satisfies all the in-dependencies of G, then G is an

Â I-map of P, and I-map stands for in-dependency map.

Â And the reason its an independancy map is by looking at G it's a map of

Â independancy as a hole in p. That is it maps independancies that we

Â know to be true. So let's look at examples in imap, so

Â let's look at two two distributions here is one distribution p1.

Â 11:11

One is G1 that has D and I being separate variables.

Â And the other, is G2 that has an edge between D and I.

Â So what independencies does G1 apply? What's an I of G1?

Â Well I of G1. Is the independence that says d is

Â independent of I, because there's no edges between them.

Â What's I of g2? One of, IT2 implies no independencies,

Â because, there are no, because D and I are connected by an edge and there's no

Â other independencies that can be implied, and so, I of G2 is named B7.

Â So what does that tell us, let's look at the connection between these graphs and

Â these two distributions, is G1 an I map of P1, well, if you look at P1, you can

Â convince yourself that P1, satisfies, D is independent of I, J1 isn't I map.

Â Fp1. What about G, is G1 an I map of P2?

Â Well, if we look at P2, we can, again, convince ourselves by examination that P2

Â doesn't satisfy. D is independent of i.

Â So the answer is no. this, g1 is not an I map for p2.

Â What about the other direction, what about g2?

Â Well, G two has no independent assumptions, and so it satisfies, and so,

Â it both P one and P two satisfy the empty set of independence assumptions.

Â As it happens P one also satisfies additional independence assumptions but

Â the definitions of Imap didn't require that that G, that a graph exactly

Â represent the independents in, in in a distribution, only that any independence

Â that's implied by the graph holds for the distribution, and so we have.

Â The G2 is actually an I-map for both P1 and P2.

Â 13:54

That's by knowing that P factorizes over G.

Â Interestingly, the converse of this theorem holds also.

Â So this version of the theorem says that if g is an imap for p, than p factorizes

Â over g. So what does that mean?

Â It means that if the graph, if the distribution satisfies the independence

Â assumptions that are implicit in the graph, then we can take that distribution

Â and represent it as a Bayesian network over that graph.

Â So once again, let's do a proof by example.

Â So now here we have the little graph that we're going to use.

Â And and so now what do we know? We're going to write down the probability

Â of the join distribution over the, these five random variables using this

Â expression. Which is the chain rule for

Â probabilities. Not the chain rule for Bayesian networks,

Â because. We don't know yet that this is

Â representable as a Bayesian network. We're going to write the chain rule for

Â probabilities. And that has us, writing it as P of D

Â times P of I given D. Which, if you multiply the two together

Â using the definition of conditional probability.

Â These two, together, give us the probability of I, D.

Â If we multiply that by G, given IND. Then we end up with the probability of G,

Â I, D. And so on, we can construct this

Â telescoping product. And altogether we, by multiplying all

Â these conditional probabilities we have a we have a, the def-,

Â we can reconstruct the original joint probability distribution.

Â Great, now what?

Â So we haven't used any notions of independence yet.

Â So where do we use that? Well, we're going to look at each one of

Â these terms. And we're going to see how we can

Â simplify it. So P of D is already as simple as it

Â gets. So now, let's look at, the next term.

Â Which is probability of I given D. And let's look at this graph.

Â And we can see that I and D are non descendants.

Â And we're not conditioning on any parents and so we have that in this graph as and

Â we've already shown this that I is independent of D.

Â And so we can using one of the definitions that we have of conditional

Â independence, we can erase this from the right hand side of the conditioning bar

Â and just get P of I. The next term is already in the form that

Â we want is so we don't need to do anything with it but now let's look at

Â the probability of S given D, I and G. So here we have the probability of a

Â variable S, given one of its parents I and one of its non-descendants D and G.

Â And we know that S is independent of its non-descendant given, non-descendants

Â given its parent I and so we can erase both D and G and so end up with the

Â probability of S given I. Finally, we have the last term,

Â probability of L, given D, I, G, and S. And here again, we have the probability

Â of a variable L, given it's parent G, and a bunch of long descendants, so we can

Â again, erase the long descendants. And so all together that gives us exactly

Â the expression that we wanted. And so now, again, by example we proved

Â this we proved this independent, we proved that from these independent

Â statements we can derive the fact that P factorizes using the product rule for

Â base E networks. To summarize, we have provided to

Â equivalent views of the graph structure. The first is the graph as a

Â factorization, as a data structure that tells us how the distribution P can be

Â broken up in to pieces. So, in this case, the fact that G allows

Â P to be represented as a set of factors or CPD's over the network.

Â The second is the Imap view, the fact that the definition that the, that the

Â structure of G and code independencies that necessarily hold in P.

Â Now what we've shown is that each of these actually implies the other.

Â