So, now we're getting into Bayes Net Lab. And we're finally going to start talking

about the actual representations that are going to be the bread and butter of what

we're going to describe in this class. And, so we're going to start by defining

the basic semantics of a Bayesian network and how it's constructed from a set of

from a set of factor. So let's start by looking at a running

example that will see us throughout a large part of at least the first part of

this course and this is what we call the student example.

So in the student example, we have a student whose taking the class for a

grade and we're going to use the first letter of of the word to denote the name

of the random variable just like we did in previous examples.

So, here, the random variable is going to be G.

now the weight of the student obviously depends on how difficult the course that

he or she is taking and the intelligence of the student.

So that gives us in addition to G, we also have D and I. And we're going to

add a couple of extra random variables just to make things a little bit more

interesting. So we're going to assume that the student has taken the SAT,

so, he may or may have not scored well on the SAT,

so that's another random variable, S. And then finally, we have, that this case

of the disappearing line, we also have the recommendation letter, L, that the

student gets from the instructor of the class.

Okay? And we're going to grossly oversimplify this problem by basically

binarizing everything except for grades. So everything has only has two values

except for grade that has three and this is only so I can write things compactly.

This is not a limitation of the framework, it's just so that the

probability distribution don't become unmanageable.

Okay. So now, let's think about how we can construct the dependencies of of

this, in this probability distribution. Okay.

So, let's start with the random variable grade.

I'm going to put G in the middle and ask ourselves what the grade of the student

depends on. And it seems, just, you know, from a

completely intuitive perspective, it seems clear that the grade of the student

depends on the difficulty of the course and on the intelligence of the student.

And so we already have a little baby Bayesian network with three random

variables. Let's now take the other random variable

and introduce them into the mix. so for example, the SAT score of the

student doesn't seem to depend on the difficulty of the course or on the grade

that the student gets in the course. The only thing it's likely to depend on

in the context of this model is the intelligence of the student.

And finally, caricaturing the way in which instructors write recommendation

letters. We're going to assume that the quality of

the letter depends only on the student's grade,

but professor's teaching, you know, 600 students or maybe 100,000 online

students. And so, the only thing that one can say about the student is by looking

at their actual grade record and so the and so, regardless of anything else, the

quality of the letter depends only on the grade.

Now, this is a model of the dependencies, it's only one model that one can

construct through these dependencies. So for example I could easily imagine

other models, for instance, ones that have students who

are brighter taking harder courses in which case, there might be potentially an

edge between I and D. But we're not going to use that model,

so let's erase that because we're going to stick with a simpler model for the

time being. But, this is only to highlight the fact

that a model is not set in stone, it's a representation of how we believe the

world works. So, here is the model drawn out a little

bit more nicely than than the picture before.

And now let's think about what we need to do in order to turn this into our

presentation of probability distribution, because right now, all it is is a bunch

of you know nodes stuck together with edges and so how do we actually get this

to represent the probability distribution?

And the way which we're going to do that is we're going to annotate each of the

nodes in the network with what's called, with a CPD.

So, we previously defined CPD. CPD is just as a reminder,

is a conditional probability distribution hm,

using the abbreviation here. And, each of theses is a CPD,

so we have five nodes, we have five CPDs. Now, if you look at some of these CPDs,

they're kind of degenerate, so for example, the difficulty CPD isn't

actually conditioned on anything. It's just a unconditional probability

distribution that tells us, for example, that courses are only 40%.

likely to be difficult and 60% to be easy.

here is a similar unconditional probability for intelligence.

Now this gets more interesting when you look at the actual conditional

probability distributions. So here, for example, is a CPD that we've

seen before this is the CPD of the grades A, B, and C.

So, here is the conditional probability date distribution that we've already seen

before for the probability of grade given intelligence, and difficulty, and we've

already discussed how each of these rows necessarily sums to one because the

probability distribution over the variable grade and we have two other

CPD's here. In this case, the probability of SAT

given intelligence and the probability of letter given grade.

So, just to write this out completely, we have P of D,

P of I, P of G given I, D, P of L given G, and P of F given I.

And that now is a fully parameterized Bayesian network and what we'll show next

is how this Bayesian network produces a joint probability distribution over these

five variables. So, here are my CPDs and what we're going

to define now is the chain rule for Bayesian networks and that chain rule

basically takes these different CPDs, these, all these little CPDs and

multiplies them together, like that.

Now, before we think of what that means, let us first note that this is actually a

factor product in exactly the same way that we just defined.

So here, we have five factors, they have overlapping scopes and what we

end up with is a factor product that gives us a big, big factor whose scope is

five variables. So what does that translate into when we

apply the chain rule for Bayesian networks in the context of the particular

example? So let's look at this particular

assignment and remember there's going to be a bunch of these different assignments

and I'm just going to compute this the probability of this one.

So the probability of d0, i1, g3, s1, and l1, well, so the first thing we need is

the probability of d0 and the probability of d0 is 0.6.