0:00

So, now we're getting into Bayes Net Lab. And we're finally going to start talking

Â about the actual representations that are going to be the bread and butter of what

Â we're going to describe in this class. And, so we're going to start by defining

Â the basic semantics of a Bayesian network and how it's constructed from a set of

Â from a set of factor. So let's start by looking at a running

Â example that will see us throughout a large part of at least the first part of

Â this course and this is what we call the student example.

Â So in the student example, we have a student whose taking the class for a

Â grade and we're going to use the first letter of of the word to denote the name

Â of the random variable just like we did in previous examples.

Â So, here, the random variable is going to be G.

Â now the weight of the student obviously depends on how difficult the course that

Â he or she is taking and the intelligence of the student.

Â So that gives us in addition to G, we also have D and I. And we're going to

Â add a couple of extra random variables just to make things a little bit more

Â interesting. So we're going to assume that the student has taken the SAT,

Â so, he may or may have not scored well on the SAT,

Â so that's another random variable, S. And then finally, we have, that this case

Â of the disappearing line, we also have the recommendation letter, L, that the

Â student gets from the instructor of the class.

Â Okay? And we're going to grossly oversimplify this problem by basically

Â binarizing everything except for grades. So everything has only has two values

Â except for grade that has three and this is only so I can write things compactly.

Â This is not a limitation of the framework, it's just so that the

Â probability distribution don't become unmanageable.

Â Okay. So now, let's think about how we can construct the dependencies of of

Â this, in this probability distribution. Okay.

Â So, let's start with the random variable grade.

Â I'm going to put G in the middle and ask ourselves what the grade of the student

Â depends on. And it seems, just, you know, from a

Â completely intuitive perspective, it seems clear that the grade of the student

Â depends on the difficulty of the course and on the intelligence of the student.

Â And so we already have a little baby Bayesian network with three random

Â variables. Let's now take the other random variable

Â and introduce them into the mix. so for example, the SAT score of the

Â student doesn't seem to depend on the difficulty of the course or on the grade

Â that the student gets in the course. The only thing it's likely to depend on

Â in the context of this model is the intelligence of the student.

Â And finally, caricaturing the way in which instructors write recommendation

Â letters. We're going to assume that the quality of

Â the letter depends only on the student's grade,

Â but professor's teaching, you know, 600 students or maybe 100,000 online

Â students. And so, the only thing that one can say about the student is by looking

Â at their actual grade record and so the and so, regardless of anything else, the

Â quality of the letter depends only on the grade.

Â Now, this is a model of the dependencies, it's only one model that one can

Â construct through these dependencies. So for example I could easily imagine

Â other models, for instance, ones that have students who

Â are brighter taking harder courses in which case, there might be potentially an

Â edge between I and D. But we're not going to use that model,

Â so let's erase that because we're going to stick with a simpler model for the

Â time being. But, this is only to highlight the fact

Â that a model is not set in stone, it's a representation of how we believe the

Â world works. So, here is the model drawn out a little

Â bit more nicely than than the picture before.

Â And now let's think about what we need to do in order to turn this into our

Â presentation of probability distribution, because right now, all it is is a bunch

Â of you know nodes stuck together with edges and so how do we actually get this

Â to represent the probability distribution?

Â And the way which we're going to do that is we're going to annotate each of the

Â nodes in the network with what's called, with a CPD.

Â So, we previously defined CPD. CPD is just as a reminder,

Â is a conditional probability distribution hm,

Â using the abbreviation here. And, each of theses is a CPD,

Â so we have five nodes, we have five CPDs. Now, if you look at some of these CPDs,

Â they're kind of degenerate, so for example, the difficulty CPD isn't

Â actually conditioned on anything. It's just a unconditional probability

Â distribution that tells us, for example, that courses are only 40%.

Â likely to be difficult and 60% to be easy.

Â here is a similar unconditional probability for intelligence.

Â Now this gets more interesting when you look at the actual conditional

Â probability distributions. So here, for example, is a CPD that we've

Â seen before this is the CPD of the grades A, B, and C.

Â So, here is the conditional probability date distribution that we've already seen

Â before for the probability of grade given intelligence, and difficulty, and we've

Â already discussed how each of these rows necessarily sums to one because the

Â probability distribution over the variable grade and we have two other

Â CPD's here. In this case, the probability of SAT

Â given intelligence and the probability of letter given grade.

Â So, just to write this out completely, we have P of D,

Â P of I, P of G given I, D, P of L given G, and P of F given I.

Â And that now is a fully parameterized Bayesian network and what we'll show next

Â is how this Bayesian network produces a joint probability distribution over these

Â five variables. So, here are my CPDs and what we're going

Â to define now is the chain rule for Bayesian networks and that chain rule

Â basically takes these different CPDs, these, all these little CPDs and

Â multiplies them together, like that.

Â Now, before we think of what that means, let us first note that this is actually a

Â factor product in exactly the same way that we just defined.

Â So here, we have five factors, they have overlapping scopes and what we

Â end up with is a factor product that gives us a big, big factor whose scope is

Â five variables. So what does that translate into when we

Â apply the chain rule for Bayesian networks in the context of the particular

Â example? So let's look at this particular

Â assignment and remember there's going to be a bunch of these different assignments

Â and I'm just going to compute this the probability of this one.

Â So the probability of d0, i1, g3, s1, and l1, well, so the first thing we need is

Â the probability of d0 and the probability of d0 is 0.6.

Â 7:18

The next thing from the next factor is the probability of i1,

Â which is 0.3. What about the next one?

Â The next one now is a conditional factor and for that, we need the probability of

Â g3, because we have g3 here, so we want from this column over here, and which row

Â do we want, we want the probability, the row that corresponds to d0 and i1, which

Â is this row, so 0.02.

Â Continuing on, we know now that we, we now have g3, so

Â we want from this row and we want the probability of l1, so we want this entry,

Â 0.01. And finally, here, we have the

Â probability s1 given given i1, so that would be this one over here and 0.8.

Â And in this exact same fashion, we're going to end up defining all of the

Â entries in this joint probability distribution.

Â Great. So, what does that give us as a

Â definition? a Bayesian network is a directed acyclic

Â graph. Acyclic means that has no cycles that is

Â you don't, you can't reverse the edges and get back you started.

Â This is typically abbreviated as a DAG and we're are going use the letter G to

Â denote to denote directed acyclic graphs. And the nodes of this directed acyclic

Â graph represent the random variable that we're interested in reasoning about, the

Â X1 up to Xn. And for each node in the graph, Xi, we

Â have a CPD that demotes the dependance of Xi on its parents in the graph G. Okay?

Â So, this is a set of variables, so this would be like the probability of

Â G given I and D. So, this would be G,

Â the Xi would be G and, and the parents of G, of, of Xi would be I and D.

Â And the BN, this Bayesian network, which represents a joint probability

Â distribution via the chain rule for Bayesian networks,

Â which is the rule that's written over here.

Â And this is just a generalization of the rule that we wrote on the previous slide,

Â where we multiplied 1CPD for each variable.

Â Here, we have, again, we're multiplying, we're multiplying over all the variables

Â I and multiplying the CPD for each of the Xis and this is once again a factor

Â product. So, that's great, and I just defined the

Â joint probability distribution, but who's to say that that joint

Â probability distribution is even a joint probability distribution?

Â So what do you have to show in order to demonstrate that something is a joint

Â probability distribution? First thing you have to show is that it's

Â greater than or equal to zero, because probability distributions better be

Â non-negative. And so, in order to do that, we need to

Â show that for this distribution P. And the way, and this as it happens is,

Â is quite trivial, because P is a product of factors.

Â 10:36

And you if multiply a bunch of non-negative factors, you get a

Â non-negative factor. That's fairly fairly, straightforward.

Â Next one's a little bit trickier. The other thing that we have to show in

Â order to demonstrate that something is a legal distribution is to prove that it

Â sums to one. So how do we prove that something sums to

Â that this probability distribution sums to one?

Â So, let's actually work through this and one of the reasons for working through

Â this is not just to convince you that I'm not lying to you, but more importantly,

Â the techniques that we're going to use for this simple argument are going to

Â accompany us throughout much of the first part of the course.

Â and so it's worth going through this in a little bit of detail.

Â but I'm going to show this in the context of the example of the distribution that

Â we just showed, we just end up cluttered with notation, but the, but the idea is

Â still exactly the same. So,

Â let's sum up over all possible assignments D, I, G, S, L of this

Â probability, of this hopefully probability distribution that I just

Â defined the P of D, I, G, S, and L. Okay?

Â So we are going to brake this up using the chain rule for Bayesian networks we

Â because that's how we defined this distribution P.

Â Okay. So here is the magic trick that we use

Â whenever we deal with distributions like that.

Â It's to notice that when we have the summation over several random variables

Â here, and here inside the summations, we have a bunch of functions inside here

Â that, only some, of the, the, each, factor only involves a small subset of

Â the variables. And when you do, when you have a situation like that, you can push

Â in the summations. So specifically, we're going to start out

Â by pushing in the summation over, L, and which factors involve L? Only the last

Â one. So, we can move in the summation over L

Â to this position ending up with the following expression.

Â So notice that now the summation is over here,

Â that's the first trick. The second trick is the definition of

Â CPDs and we observe here that the summation over L is summing up over all

Â possible mutually exclusive and exhausted values of the variable L in this,

Â conditional probability distribution P of L given G.

Â And that means, basically, we're summing up over a row of

Â the CPD and that means that the sum has to be one.

Â And so this term effectively cancels out, which gives us this,

Â and now we can do exactly the same thing. We can take S and move it over here, and

Â that gives us that, and this two is a sum over a row of a CPD, so this two

Â cancelled and can be written as one. We could do the same with G, and again,

Â this has to be equal to one and so we can cancel this and so on and so forth.

Â And so, by the end of this, you're going to have, you're, you're going to

Â have cancelled all of the variables in the summation and what remains is one.

Â Now, as I said, this is a very simple proof, but these tools of pushing in

Â summations and removing variables that are not relevant are going to turn out to

Â be very important. Okay.

Â So now, let's just define a little bit of a terminology that will, again, accompany

Â us later on. We're going to say that a distribution P

Â factorizes over G, means we can represent it over the graph G if we can encode it

Â using the chain rule for Bayesian networks.

Â So the distribution factorizes over big, over a graph G if I can represent it in

Â this way as a product of these conditional probabilities.

Â 14:40

So let me just conclude this with a little example which is arguably, the

Â very first example of Bayesian networks ever.

Â it was an example of Bayesian networks before anybody even called them Bayesian

Â networks. and it was defined actually by

Â statistical geneticists who were trying to model the notion of genetic

Â inheritance in in a population. And, so here is a an example of genetic

Â inheritance of blood types, so let me just give a little bit of background on

Â the genetics of this. So, in this very simple, oversimplified example a person's

Â genotype is defined by two values, because we have two copies of of each of

Â those, of each of those chromosomes in this case.

Â And, so it, so for each chromosome, we have three possible values,

Â so we have either the A value, the B Value, or the O value.

Â These are familiar to many of you from blood types,

Â which is what we're modelling here. And so the total number of genotypes that

Â we have is listed over here, we have the AA genotype,

Â AB, AO, BO, BB, and OO, so we have a total of six distinct

Â genotypes. However, we have few more phenotypes

Â because it turns out that AA and AO manifest exactly the same way as blood

Â type A and BO and BD manifest as blood type B, and so we have only four

Â phenotypes. So how do we model genetic inheritance in

Â a Bayesian network? here is a little Bayeseian network for

Â genetic inheritance, we can see that for each individual, say Marge.

Â We have a genotype variable, which takes on six values, one of the six values that

Â we saw before. And the, the phenotype, in this case the

Â blood type, depends only on the person's genotype.

Â We also see that a person's genotype depends on the genotype of her parents,

Â because after all, she inherits one chromosome from her mother and one

Â chromosome from her father. And so and so the,

Â a person's individual genotype just depends on those two variables.

Â And so this is very nice and very compelling Bayesian network, because

Â everything just fits in beautifully in terms of

Â things really do depend only on only on the variables that that we stipulate in

Â the model that they depend on.

Â