0:02

When we previously defined the notion of Bayesian estimation and showed how it

Â could be applied in the context of a single random variable say, a multinomial

Â random variable. Now we're going step back to the world of

Â probabilistic graphical models and think about the application of these ideas to

Â the problem of estimating parameters in the Bayesian network.

Â So, let's draw again the probabilistic graphical model that represents Bayesian

Â estimation in a Bayesian network. So just as before, in the single variable

Â case, we're going to inject into the model explicitly the parameters that

Â characterize sorry,

Â random variables that define the parameters.

Â And so here we have two random variables theta X which represent the CPD of X,

Â and theta Y given X, which represents CPD P of Y given X.

Â Now notice that each of these actually, vector value, because there's going to be

Â multiple actual numbers in these in each of these CPDs,

Â but we're going to draw them as single circles.

Â Now once again we can look at this network and read out certain important

Â conclusions. So the first important conclusion is that

Â the instances, these XY pairs, are independent given the parameters and we

Â can see that by noticing that if I condition on both theta X and theta Y

Â given X, then the XY pairs become conditionally become d-separated from

Â each other and so we have conditional independence following as a consequence

Â of the structure of the graphical model. We also have, and this is another

Â explicit property that we can read from this diagram is that theta X and theta Y

Â given X are marginally independent. So a priori, we have that the parameter

Â prior over all of the, all of the parameters.

Â theta can be written as the product over i, and in this case, i is being the

Â random variables in the network of the prior over the CPD for Xi and so the

Â prior is the product of little priors one through each CPD.

Â 2:17

Now, it follows from this, from, by just writing down the the

Â graphical model and looking at what the implications of the expressions are, that

Â the posteriors of this data are also independent given complete data.

Â And the reason for that is that complete data d-separates the parameters for the

Â two CPDs. If you look at at this network over here

Â and assume that we have all of these variables conditioned observed,

Â then you can see that there is no active path between theta X and theta Y given X.

Â because, for example, if we look at potentially, this trail, we can see that

Â X[2] blocks the trail from theta X to theta Y given X.

Â And so again, following directly from the structure of this network, we can see

Â that the posterior distribution theta X, theta Y given X given d decomposes as a

Â product of the posterior over theta x given d times the posterior of theta Y, Y

Â given X given D. Which means that just as in maximum

Â likelihood estimation where you could break up the estimation problem into one

Â of estimating each CPD separately, we can do the same here.

Â Only now, we can do it using Bayesian estimation, where instead of just picking

Â a single parameter setting for the CPD, we compute the separate posteriors

Â separately and then put them together into a single posterior.

Â 4:04

Now, it turns out that we can do even finer breakdown in the context of table

Â CPDs. So here we have and we're now looking at

Â the binary case where X is a binary valued random variable.

Â So now we have two multinomials in our CPD, one for Y given X, one corresponding

Â to the case of Y given X1 and the other Y given X0.

Â And it turns out that if, again in this model we're assuming, that these are

Â independent a priori, which is what this diagram says, because you notice there's

Â no edges between so you notice that they are marginally independent, theta Y given

Â X1 and theta Y given X0. If we're postulating that that holds as

Â we are in this diagram, it turns out that they are also

Â independent in the posterior. Now, that is a little bit trickier to

Â show, because it turns out that you can read it directly from the diagram because

Â in fact even given complete data we have an active trail that goes from theta Y

Â given X1 through Y1, which since it's observed, it activates the V structure

Â into theta Y given X0. [COUGH] But it turns out that if we go

Â back to some of the examples of context-specific independence, that we

Â had in the case of specifically a multiplexer CPD, we can derive that these

Â are in fact, despite the appearance of an activated V structure, conditionally

Â dependent in the posterior as well. And so, once again,

Â we can compute the posterior as a product of posteriors of the form p of theta X

Â given d times the probability of theta Y given X1 given d times the probability of

Â theta one given X0. Okay.

Â 6:08

So we can generalize this to a general Bayesian network and let's assume that we

Â have a Bayesian netork with table CPDs that's specified in terms of multinomial

Â parameters of the form theta X given u, where u is some assignment to X's

Â parents, u, Then if, for each such multinomial

Â parameter, we have a Dirichlet prior with appropriate hyperparameters, then, we can

Â show using the kind of analysis that we just did, combine with the analysis of

Â the posterior for a single multinomial, that the posterior is now a Dirichlet,

Â with hyperparameters that represents the prior that we had for that multinomial

Â plus the sufficient statistics that represent the count in the [INAUDIBLE]

Â particular combinations of the parent and the child.

Â And so for example, for the, entry in the multinomial representing the value little

Â x1 for X and the [INAUDIBLE] little u for the parents u,

Â we have, this prior parameter plus the count in the data for that combination of

Â x and u. So,

Â now we know how to take a set of priors, and use the data to update them, to form

Â posteriors. Now, let's think about where the priors

Â might come from. And a priori, it might seem very daunting

Â to construct a set of priors for all of the nodes in a Bayesian network.

Â It turns out however that there is a general purpose, shceme for doing that,

Â that, is both, easy and has some good theoretical properties, and, that, scheme

Â works as follows, so, what we're going to do, is we're going to define a prior

Â basion network, that has some set of parameters state of zero.

Â And we're going to define a single equivalent sample size.

Â 8:24

Which is going to be applicable or applied to all of the nodes in the

Â Bayesian network, and so in order to specify the parameter, the hyper

Â parameter alpha X given u for for an assignment X equals little x, and u

Â equals little u. We're simply going to compute the

Â probability in this parameterized network of X and u and we're going to multiply it

Â by the equivalent sample size alpha. Now, in many cases, you are just going to

Â use theta zero to be the uniform parameters which makes it all a very easy

Â computation. But this provides a simple, coherent way

Â to specific all of the hyperparameters simultaneously.

Â And so, let's look at an example, here is a network xy that, that has no

Â edge and and let's imagine that, that is our

Â 9:47

And now, let's look at what we would get for different situations in a network

Â where we have X being a parent of Y which is the network with parameters we

Â actually want to estimate. And so what we would get for.

Â And let's assume they're both binary, X and Y.

Â And so X is going to be distributed as a parameter with Dirichlet with

Â hyperparameters, alpha over two, alpha over two.

Â And Y is going to be distributed, remember, so,

Â not what. Not X theta of X is going to be

Â distributed this way. And theta of Y given X0 was going to be

Â distributed in the following way. So Dirichlet with hyperparameters alpha

Â times the probability of X, Y, which is the uniform distribution is a

Â quarter. And similarly for theta Y given X1 is

Â going to have the same, here say distribution.

Â And if you think about this, this makes perfect sense, because it tells us that

Â we have seen the same number of examples of X as we have of Y.

Â It's just that in Y, we had to partition the examples of X, of Y over those where

Â we had Xxo. = X0 and those where we had Xx1.

Â = X1. If on the other hand, we had say

Â Dirichlet of alpha over two alpha over two for the two except for the two

Â multinomials corresponding to, the two corresponding to Y,

Â this one and this one. It would, it would imply that we've seen

Â twice as many Ys as we've seen Xs. So let's see what kind of effect using

Â the Bayesian estimation has on a pseudo real world example.

Â So, this is actually a real network. It was developed for monitoring patients

Â in an ICU and we call it the ICU-Alarm network.

Â And it turns out that the ICU-Alarm network has 37 different variables that

Â represent things like whether the patient was intubated the patient's blood

Â pressure heart rate, and various other medical events that might happen.

Â And, it turns out that overall, the network has 504 parameters.

Â Now, there aren't actually data cases here, this was a hand constructed

Â network, and so what we're going to do is we're going to sample instances from the

Â network. And, then, we're going to pretend that we don't know the network

Â parameters and see the extent of which we can recover the network parameters via

Â learning from the instances that we sampled from it.

Â 12:54

I should say that this is a pseudorealistic learning problem, because

Â the instances that one samples from a network are, are always cleaner than the

Â instances that one gets in the context of a real world data case, data set, because

Â it, in a real world scenario, it is rarely the case that the network whose

Â structure you have the network whose that you're trying to learn has the exact same

Â structure as the true underlying distribution from which the data were

Â generated. And so this is a much cleaner scenario,

Â but still it's useful and indicative. So what we see here, are, the results of

Â learning, as a function of the x-axis, which is the number of samples and the

Â y-axis is a distance function between the true distribution and the learn

Â distribution, and that distance function we're not going to talk about this at the

Â moment, it's the notion called the relative entropy, it's also called KL

Â divergence. But what we need to know about this for

Â the purposes of the current discussion is that when distributions are identical

Â it's zero, and otherwise it's non-negative.

Â So, what we see here is that the blue line, corresponds to maximum life data

Â information. And we can see several things about the

Â poline. First of all it's very jagged, there's a

Â lot of bumps in it, and second, it's consistently higher then all of the other

Â lines. Which means that max and likelihood

Â estimation, although it does continue to get lower as we get more data, with as

Â high as five thousand data points, we still haven't gotten close, to the true

Â underlying distribution. Conversely let's see what happens with a

Â Bayesian estimation. This is all Bayesian estimation with a

Â uniform prior. And different, equivalent sample size.

Â So this is using a prior network with a uniform network in different values of

Â alpha. And what we see here is that, for alpha

Â equals five. That's the green line.

Â Alph equals ten are almost sitting directly on top of each other and they're

Â both considerably lower then all of the other lines and also the maximum likely

Â destination. As we increase the prior strength so that

Â we are have a firmer belief in in. A uniform prior.

Â We can see that we move a little bit away.

Â and now the performance becomes a little worse.

Â But notice that by around 2,000 data points we're already pretty close to the

Â case that we were for an equivalent sample size of five.

Â For 50, which is this dark blue line. It takes a little bit longer to converge,

Â and it doesn't quite make it, But even with an equivalent sample size of 50,

Â which is pretty high. you get convergence to the correct

Â distribution much faster than you do from maximum likelihood destination.

Â So, to summarize. in Bayesian networks, if we're doing

Â Bayesian parameter estimation. If we're willing to stipulate that the

Â parameters are independent, a priori. Then they're also independent in the

Â posterior. Which allows us to maintain the posterior

Â as a product of posteriors over individual parameters.

Â For multinomial Bayesian networks, we can go ahead and, con-, perform Bayesian

Â estimation using the exact same sufficient statistics that we used for

Â maximum likelihood destination. Which are the counts corresponding to a

Â value of the variable and a value of its parents.

Â And whereas, in the context of maximum likelihood estimation, we would simply

Â use the formula on the left. In the case of Bayesian estimation, we

Â are going to use the formula on the right.

Â Which has exactly the same form. Only, it also accounts for the

Â hyperparameters. And, in order to do this kind of process,

Â we need a choice of prior, and we show how that can be effectively elisteted

Â using both a prior distribution specified say, as Bazy network as well as an

Â equivalent sample size.

Â