So, now we're getting into Bayes Net Lab. And we're finally going to start talking about the actual representations that are going to be the bread and butter of what we're going to describe in this class. And, so we're going to start by defining the basic semantics of a Bayesian network and how it's constructed from a set of from a set of factor. So let's start by looking at a running example that will see us throughout a large part of at least the first part of this course and this is what we call the student example. So in the student example, we have a student whose taking the class for a grade and we're going to use the first letter of of the word to denote the name of the random variable just like we did in previous examples. So, here, the random variable is going to be G. now the weight of the student obviously depends on how difficult the course that he or she is taking and the intelligence of the student. So that gives us in addition to G, we also have D and I. And we're going to add a couple of extra random variables just to make things a little bit more interesting. So we're going to assume that the student has taken the SAT, so, he may or may have not scored well on the SAT, so that's another random variable, S. And then finally, we have, that this case of the disappearing line, we also have the recommendation letter, L, that the student gets from the instructor of the class. Okay? And we're going to grossly oversimplify this problem by basically binarizing everything except for grades. So everything has only has two values except for grade that has three and this is only so I can write things compactly. This is not a limitation of the framework, it's just so that the probability distribution don't become unmanageable. Okay. So now, let's think about how we can construct the dependencies of of this, in this probability distribution. Okay. So, let's start with the random variable grade. I'm going to put G in the middle and ask ourselves what the grade of the student depends on. And it seems, just, you know, from a completely intuitive perspective, it seems clear that the grade of the student depends on the difficulty of the course and on the intelligence of the student. And so we already have a little baby Bayesian network with three random variables. Let's now take the other random variable and introduce them into the mix. so for example, the SAT score of the student doesn't seem to depend on the difficulty of the course or on the grade that the student gets in the course. The only thing it's likely to depend on in the context of this model is the intelligence of the student. And finally, caricaturing the way in which instructors write recommendation letters. We're going to assume that the quality of the letter depends only on the student's grade, but professor's teaching, you know, 600 students or maybe 100,000 online students. And so, the only thing that one can say about the student is by looking at their actual grade record and so the and so, regardless of anything else, the quality of the letter depends only on the grade. Now, this is a model of the dependencies, it's only one model that one can construct through these dependencies. So for example I could easily imagine other models, for instance, ones that have students who are brighter taking harder courses in which case, there might be potentially an edge between I and D. But we're not going to use that model, so let's erase that because we're going to stick with a simpler model for the time being. But, this is only to highlight the fact that a model is not set in stone, it's a representation of how we believe the world works. So, here is the model drawn out a little bit more nicely than than the picture before. And now let's think about what we need to do in order to turn this into our presentation of probability distribution, because right now, all it is is a bunch of you know nodes stuck together with edges and so how do we actually get this to represent the probability distribution? And the way which we're going to do that is we're going to annotate each of the nodes in the network with what's called, with a CPD. So, we previously defined CPD. CPD is just as a reminder, is a conditional probability distribution hm, using the abbreviation here. And, each of theses is a CPD, so we have five nodes, we have five CPDs. Now, if you look at some of these CPDs, they're kind of degenerate, so for example, the difficulty CPD isn't actually conditioned on anything. It's just a unconditional probability distribution that tells us, for example, that courses are only 40%. likely to be difficult and 60% to be easy. here is a similar unconditional probability for intelligence. Now this gets more interesting when you look at the actual conditional probability distributions. So here, for example, is a CPD that we've seen before this is the CPD of the grades A, B, and C. So, here is the conditional probability date distribution that we've already seen before for the probability of grade given intelligence, and difficulty, and we've already discussed how each of these rows necessarily sums to one because the probability distribution over the variable grade and we have two other CPD's here. In this case, the probability of SAT given intelligence and the probability of letter given grade. So, just to write this out completely, we have P of D, P of I, P of G given I, D, P of L given G, and P of F given I. And that now is a fully parameterized Bayesian network and what we'll show next is how this Bayesian network produces a joint probability distribution over these five variables. So, here are my CPDs and what we're going to define now is the chain rule for Bayesian networks and that chain rule basically takes these different CPDs, these, all these little CPDs and multiplies them together, like that. Now, before we think of what that means, let us first note that this is actually a factor product in exactly the same way that we just defined. So here, we have five factors, they have overlapping scopes and what we end up with is a factor product that gives us a big, big factor whose scope is five variables. So what does that translate into when we apply the chain rule for Bayesian networks in the context of the particular example? So let's look at this particular assignment and remember there's going to be a bunch of these different assignments and I'm just going to compute this the probability of this one. So the probability of d0, i1, g3, s1, and l1, well, so the first thing we need is the probability of d0 and the probability of d0 is 0.6. The next thing from the next factor is the probability of i1, which is 0.3. What about the next one? The next one now is a conditional factor and for that, we need the probability of g3, because we have g3 here, so we want from this column over here, and which row do we want, we want the probability, the row that corresponds to d0 and i1, which is this row, so 0.02. Continuing on, we know now that we, we now have g3, so we want from this row and we want the probability of l1, so we want this entry, 0.01. And finally, here, we have the probability s1 given given i1, so that would be this one over here and 0.8. And in this exact same fashion, we're going to end up defining all of the entries in this joint probability distribution. Great. So, what does that give us as a definition? a Bayesian network is a directed acyclic graph. Acyclic means that has no cycles that is you don't, you can't reverse the edges and get back you started. This is typically abbreviated as a DAG and we're are going use the letter G to denote to denote directed acyclic graphs. And the nodes of this directed acyclic graph represent the random variable that we're interested in reasoning about, the X1 up to Xn. And for each node in the graph, Xi, we have a CPD that demotes the dependance of Xi on its parents in the graph G. Okay? So, this is a set of variables, so this would be like the probability of G given I and D. So, this would be G, the Xi would be G and, and the parents of G, of, of Xi would be I and D. And the BN, this Bayesian network, which represents a joint probability distribution via the chain rule for Bayesian networks, which is the rule that's written over here. And this is just a generalization of the rule that we wrote on the previous slide, where we multiplied 1CPD for each variable. Here, we have, again, we're multiplying, we're multiplying over all the variables I and multiplying the CPD for each of the Xis and this is once again a factor product. So, that's great, and I just defined the joint probability distribution, but who's to say that that joint probability distribution is even a joint probability distribution? So what do you have to show in order to demonstrate that something is a joint probability distribution? First thing you have to show is that it's greater than or equal to zero, because probability distributions better be non-negative. And so, in order to do that, we need to show that for this distribution P. And the way, and this as it happens is, is quite trivial, because P is a product of factors. It actually is a product of CPD and CPDs are non-negative. And you if multiply a bunch of non-negative factors, you get a non-negative factor. That's fairly fairly, straightforward. Next one's a little bit trickier. The other thing that we have to show in order to demonstrate that something is a legal distribution is to prove that it sums to one. So how do we prove that something sums to that this probability distribution sums to one? So, let's actually work through this and one of the reasons for working through this is not just to convince you that I'm not lying to you, but more importantly, the techniques that we're going to use for this simple argument are going to accompany us throughout much of the first part of the course. and so it's worth going through this in a little bit of detail. but I'm going to show this in the context of the example of the distribution that we just showed, we just end up cluttered with notation, but the, but the idea is still exactly the same. So, let's sum up over all possible assignments D, I, G, S, L of this probability, of this hopefully probability distribution that I just defined the P of D, I, G, S, and L. Okay? So we are going to brake this up using the chain rule for Bayesian networks we because that's how we defined this distribution P. Okay. So here is the magic trick that we use whenever we deal with distributions like that. It's to notice that when we have the summation over several random variables here, and here inside the summations, we have a bunch of functions inside here that, only some, of the, the, each, factor only involves a small subset of the variables. And when you do, when you have a situation like that, you can push in the summations. So specifically, we're going to start out by pushing in the summation over, L, and which factors involve L? Only the last one. So, we can move in the summation over L to this position ending up with the following expression. So notice that now the summation is over here, that's the first trick. The second trick is the definition of CPDs and we observe here that the summation over L is summing up over all possible mutually exclusive and exhausted values of the variable L in this, conditional probability distribution P of L given G. And that means, basically, we're summing up over a row of the CPD and that means that the sum has to be one. And so this term effectively cancels out, which gives us this, and now we can do exactly the same thing. We can take S and move it over here, and that gives us that, and this two is a sum over a row of a CPD, so this two cancelled and can be written as one. We could do the same with G, and again, this has to be equal to one and so we can cancel this and so on and so forth. And so, by the end of this, you're going to have, you're, you're going to have cancelled all of the variables in the summation and what remains is one. Now, as I said, this is a very simple proof, but these tools of pushing in summations and removing variables that are not relevant are going to turn out to be very important. Okay. So now, let's just define a little bit of a terminology that will, again, accompany us later on. We're going to say that a distribution P factorizes over G, means we can represent it over the graph G if we can encode it using the chain rule for Bayesian networks. So the distribution factorizes over big, over a graph G if I can represent it in this way as a product of these conditional probabilities. So let me just conclude this with a little example which is arguably, the very first example of Bayesian networks ever. it was an example of Bayesian networks before anybody even called them Bayesian networks. and it was defined actually by statistical geneticists who were trying to model the notion of genetic inheritance in in a population. And, so here is a an example of genetic inheritance of blood types, so let me just give a little bit of background on the genetics of this. So, in this very simple, oversimplified example a person's genotype is defined by two values, because we have two copies of of each of those, of each of those chromosomes in this case. And, so it, so for each chromosome, we have three possible values, so we have either the A value, the B Value, or the O value. These are familiar to many of you from blood types, which is what we're modelling here. And so the total number of genotypes that we have is listed over here, we have the AA genotype, AB, AO, BO, BB, and OO, so we have a total of six distinct genotypes. However, we have few more phenotypes because it turns out that AA and AO manifest exactly the same way as blood type A and BO and BD manifest as blood type B, and so we have only four phenotypes. So how do we model genetic inheritance in a Bayesian network? here is a little Bayeseian network for genetic inheritance, we can see that for each individual, say Marge. We have a genotype variable, which takes on six values, one of the six values that we saw before. And the, the phenotype, in this case the blood type, depends only on the person's genotype. We also see that a person's genotype depends on the genotype of her parents, because after all, she inherits one chromosome from her mother and one chromosome from her father. And so and so the, a person's individual genotype just depends on those two variables. And so this is very nice and very compelling Bayesian network, because everything just fits in beautifully in terms of things really do depend only on only on the variables that that we stipulate in the model that they depend on.