0:00

We talked about Bayes net structural learning as the problem of optimizing a

Â scoring function over the space of network structures.

Â And we also talked about how the likelihood score which is the simplest

Â score that one could come up with has significant issues relative to

Â overfitting. We're now going to look at a different

Â score that's derived from Bayesian principles.

Â And we're going to see how that score though it has some surface similarities

Â to the likelihood score is a considerably better score in terms of its avoiding of

Â overfitting. So the Bayesian score is derived from the

Â Bayesian paradigm that says that anything that we have uncertainty over need to

Â have a probability distribution over it. And so if we have uncertainty over graphs

Â then we need to have a prior over graphs. And if we have uncertainty over

Â parameters we need to have uncertainty over parameters, a probability

Â distribution. And, so if we now define our optimization

Â problem as that of finding, the graph G that maximizes the probability of G g

Â given D. Now if we rewrite the probability of G

Â given D using Bayes rule. we end up with, the following expression.

Â P of D given G * P of G / P of D. So let's look at each of those terms

Â separately. The first of these is a term called the

Â marginal likelihood. P of D given G is the probability of the

Â data, given the graph. The second of these terms, P of G is a

Â prior over graph structures. And, the final term P of D, the

Â denominator is the, is called the marginal probability of the data.

Â Now, importantly this marginal probability of the data is independent of

Â G, and so it's not going to affect, the choice of which graph we select, and

Â therefore we can ignore it in the context of a model selection problem that is of

Â finding the single, graph or a graph that maximizes the score.

Â And so we're going to define the Bayesian score, score B of, G, of relative to a

Â data set D, as the log of the numerator of this expression, so the log of the

Â marginal likelihood plus the log of the prior over graphs.

Â Now we might think, that this score is going to avoid over fitting because of

Â its use of the prior over refs. And although that prior can play a role,

Â actually it turns out to be a significantly less important role than

Â the first of these terms, which is the marginal likelihood.

Â So, let's look at the marginal likelihood in a little more depth.

Â So, the marginal likelihood, p of d given g, is not as it happens, the same as the

Â log likelihood. Because that marginal likelihood is

Â something that integrates over all possible settings of the parameters.

Â And so from a mathematical perspective, what we're going to do is we're going to

Â introduce a variable, theta G into the probability expression.

Â And then we're going to integrate it out which is the same as summing it out.

Â Only it's a continuous space, so we're going to use integrals.

Â So P of D given G is the interval over all possible model parameterizations

Â theta G of the probability of d given g and theta g.

Â Times the probability of theta G given G. Now the first of these two terms is the

Â likelihood which is exactly the the component that we used in the log

Â likelihood score. But, importantly, unlike the likelihood

Â score, we're not computing P of D with, given G.

Â And theta G just for the maximum likelihood parameters, theta had G.

Â Rather we're computing this probability averaged over all possible parameter

Â settings, which gives us a considerably less optimistic assessment of the

Â probability of the data given a particular structure, because we have to

Â take into consideration not just a single parameter of setting that is geared

Â exactly toward our data set, which is the state of half set parameters, but rather

Â averaging out over all possible parameter settings theta G using the prior over

Â parameters. So, that less optimistic assessment is

Â sort of one intuition as to why this might not over fit as much.

Â But it turns out there is another perhaps even more intuitive explanation as to why

Â this square reduces over fitting. So let's look at the, this marginal

Â likelihood term, P of D given G. And just write, rewrite it as the

Â probability of all the instances X1 of the XM given G.

Â And what we're going to do is, we're going to now break up this joint

Â distribution using the chain rule for probabilities.

Â Not the chain rule for Bayesian networks. Just the chain rule for probabilities and

Â we can write that as the probability of X1 given G times the probability of X2

Â given X1 and G tomes the probability of X3 given X1 and X2 times blah, blah, blah

Â up to the probability of XM given all of the previous ones

Â X1 up to XM -1 and G. And if we look at each of these terms

Â each of these is effectively making a prediction over an unseen instance say XM

Â given X1 up to XM - 1. So you can think of it as almost doing

Â some kind of cross validation or generalization or estimate of

Â generalization ability because we're estimating the ability to predict an

Â unseen instance given the previous ones. And so the probability of D in some sense

Â incorporates into it, some kind of analysis of generalization ability.

Â Now you might say well surely the likelihood does the, the standard

Â likelihood score does exactly the same thing.

Â But a little bit more thought reveals that if we want to do this kind of

Â analysis for the maximum likelihood set of parameters that maximum likelihood set

Â of parameters theta hat G depends on all of D.

Â All of the instances D, in D, and so we can't break it down in

Â this way, because because D of, because the, if we had theta had G on the right

Â hand side, it's already incorporating into it, all of the instances including

Â the unseen ones, which is yet another intuition as to why that like, the, the

Â maximum core tends to over fit. Now,

Â the maximum, now the Bayesian score, might look a little bit scary, because it

Â has all this integral in it, and, and who knows how we would ever compute it.

Â It turns out that for the case of multinomial Bayesian networks the the

Â likelihood, the Bayesian score, can actually be written in closed form using

Â function called the gamma function. The gamma function is as written over

Â here it's also integral but it turns out that the gamma function is really just a

Â continuous extension of the factorial function.

Â Because gamma X satisfies the quality gamma X is = to x * gamma x - 1.

Â And most computers have an implementation of the gamma function in the.

Â So with the gamma function we can actually rewrite the probability of D

Â given G as a product that looks as follows.

Â The first is first we have a product over all variables i.

Â And then we have two different products included in this big this big outer

Â product. The first is a product over the parent

Â set UI of multiplying over all possible values of the parent set.

Â And here we have an expression involving gamma functions.

Â Where this expression involves both the Dirichlet prior parameters.

Â The alphas, as well as the sufficient statistics.

Â So the first product enumerates overall possible assignments to the parents.

Â And then we have an inner product, yet another inner product which innumerates

Â overall possible values of the variable itself.

Â so the variable, all possible values XIJ of the variable XI.

Â And once again we have gamma terms that involve both the prior, as well as the

Â sufficient statistics. And the, while this expression might look

Â still a little bit scary, it it's something that one can just plug into the

Â computer and, and compute quite easily. Another valuable property of this

Â marginal likelihood, if we look at it a little bit more and we also take its

Â logarithm is we see that this expression is, which was originally a product over

Â variables i of some, complicated expression relating to xi.

Â Is actually something that if we take the log turns into a sum over i of the family

Â score something that could be viewed as the family score for variable xi and, its

Â parent set of together. And so, just like other scores that we've

Â seen before. This this scoring function can decomposes

Â as the sum over i of terms involving only xi and its parents.

Â And we'll see that this property can be quite important from a computational

Â perspective. How do we sum up building up.

Â From the marginal likelihood the second term in this expression is log of P of G.

Â And so in order to accommodate for that, we need to have a prior over P of G.

Â And there's a variety of different priors of people views.

Â it turns out that a quite common choice is to simply make P of G constant and

Â that although it doesn't impose explicitly any penalty on complexity, it

Â turns out that it works quite well because of the penalty imposed on

Â complexity by the marginal likelihood. But if I want to build in additional

Â penalties on complexity, we can make the prior something that exponentially

Â penalizes the number of edges or penalizes the number of parameters and

Â that induces additional scarcity. Now importantly, we don't actually want

Â to define a prior probability over graph structures.

Â And make sure that it correctly normalizes over all possible structures.

Â And even more so over all possible acyclic structures but fortunately we

Â don't have to because the normalizing constant in this distribution P of G.

Â Is this constant across different networks, and, and can therefore be

Â ignored completely. And we only need to consider the term the

Â changes with the changes with the graph structure and ignore the normalizing

Â constant or partition function. That's the structure prior.

Â What about the parameter prior? The parameter prior that we use in the

Â context of the Bayesian score the one that's most commonly used is something

Â called the B to E prior. And the B to E prior is something that

Â we've actually seen before when we were talking about parameter estimation for

Â Bayesian networks. And as a reminder, the B to E prior is

Â defined by two components, one is an equivalent sample size alpha which is the

Â total number of instances that we might have seen in some imaginary universe and

Â the second is a probability distribution p0 typically encoded by a prior Bayesian

Â network. B0, which encodes sort of our prior

Â beliefs about about the universe. And, the

Â And so we define now the imaginary counts that we might have seen for a particular

Â combination of values for little xi. In particular assignments for its parents

Â to be, alpha which is the total number of counts that we've seen times the

Â probability of that particular event. Xi, PAI of G.

Â In the network, in a prior network B0. Now an important note is that the parents

Â of the variable xi, in the graph G is not the same as the parents of xi in the

Â prior network B0. In fact, in many cases, we choose the

Â network in the network B0 to be one that has no edges where all the variables are

Â independent. but we compute the probability

Â distribution in b0, and then use that to compute the the, hyper parameters alpha,

Â that are used in the context of the of them of the learning problem for for

Â for given graph G. And the important thing is now, that the

Â single network be zero provides us with priors for all of the candidate networks,

Â so we don't need to ellicit, priors for exponentially many networks, we have a

Â the single network B0, and a single equivalent sample size and we can use

Â that to compute the parameter prior P of theta given G over all possible networks

Â that we're interested in evaluating. Now, other than it's convenience, why

Â this prior as oppose to others. It turns out that one can show that this

Â prior the unique. Prior.

Â Over, multinomial Bayesian networks with the important property that if two

Â networks are i equivalent, then they have the same Bayesian score.

Â That is, if we use a different set of Dirichlet priors, other than one that

Â fits this mold. We would have a situation where you might

Â have two networks G and G prime that were i equivalent that have, different

Â Bayesian scores and there's really, no justification for that, in terms of,

Â incoorporating that into the parameter priors, because, these networks are

Â completely equivalent in their ability to represent the probability distribution,

Â or the same set of probability distributions so why would we, have one

Â of them, have a different Bayesian scores than others or rather, if we do have some

Â prior knowledge that one of these graphs is is more suitable than the other we

Â should put that into our structure prior. Not into our parameter prior,

Â one interesting property of the BDE score has to do with its asymptotic behavior.

Â So, let's consider what happens to the BDE score as, or in fact the Bayesian

Â score in general, as M goes to infinity. and it turns out that as M goes to

Â infinity, a network G with Dirichlet priors satisfies the following, in terms

Â of the marginal likelihood. That the law of the marginal likelihood

Â is equal to the sum of these three terms. The first of these is just a log

Â likelihood of the data given the maximum likelihood of parameters.

Â 16:11

Which we've already seen has some good properties in terms of trying to fit the

Â data. But on the other hand, is also liable to

Â overfit. The second term, however, is log of m

Â over two where m is the number of instances times then G which is the

Â number of independent parameters. Which as we've seen in the context of a

Â multinomial distribution is the number of entries in the distribution minus one.

Â And this term has a negative sign which means that we are going, this log, the

Â log of the marginal likelihood is going to be decreased as we increase the number

Â of parameters and so there is a tension between those two terms.

Â The first of these tries to have more complicated networks in order to maximize

Â the fit to data, as we've already seen. And the second tries to reduce the

Â complexity of the model by increasing, decreasing the number of parameters.

Â The third term, is this term O of one, which indicates in formal notation, that

Â this third term is effectively constant relative to M which means it doesn't grow

Â with M. With the number of instances which means

Â that as the number of instances grows, only these first two terms are going to

Â play a role in terms of evaluating which structure is going to be selected.

Â Now these first two terms, as it happens, have a name and that, name is something,

Â called the BIC score, which really looks just add the likelihood component and the

Â penalty term, and as and in a different part of this course we talk about the BIC

Â score, and some of the properties that it has, specifically, for example, the fact

Â that as M grows to infinity, the score is consistent which means that the graph or

Â one of its i the, if the correct graph for one of its I equivalent graphs is

Â going to have the highest score among all possible graphs that we're considering.

Â So this is yet another demonstration that the Bayesian score is a way of avoiding

Â over fitting. Because, at the large sample limit, we

Â will learn a correct graph. Either the correct graph or one of its I

Â equivalent, structures. To summarize, the Bayesian score uses

Â Bayesian principles. Specifically averaging out over

Â parameters that we have uncertainty over, to avoid over-fitting.

Â the Bayesian score is most often instantiated as a particular variant

Â called BDE. Which requires that we assess a prior

Â network in order to compute the prior over parameters.

Â But that also allows us to, a natural place to incorporate prior knowledge into

Â the learning algorithm. And the BDE prior has the important

Â property that I equivalent networks have the same score.

Â The Bayesian score is at the large sample limit equivalent to a different score

Â called VIC which we analyse seperatley in this course. And as a consequence of this

Â equivalence one can show that it's asymptotically consitent in that it

Â learns the correct network as M goes to infinity.

Â