We talked about Bayes net structural learning as the problem of optimizing a scoring function over the space of network structures. And we also talked about how the likelihood score which is the simplest score that one could come up with has significant issues relative to overfitting. We're now going to look at a different score that's derived from Bayesian principles. And we're going to see how that score though it has some surface similarities to the likelihood score is a considerably better score in terms of its avoiding of overfitting. So the Bayesian score is derived from the Bayesian paradigm that says that anything that we have uncertainty over need to have a probability distribution over it. And so if we have uncertainty over graphs then we need to have a prior over graphs. And if we have uncertainty over parameters we need to have uncertainty over parameters, a probability distribution. And, so if we now define our optimization problem as that of finding, the graph G that maximizes the probability of G g given D. Now if we rewrite the probability of G given D using Bayes rule. we end up with, the following expression. P of D given G * P of G / P of D. So let's look at each of those terms separately. The first of these is a term called the marginal likelihood. P of D given G is the probability of the data, given the graph. The second of these terms, P of G is a prior over graph structures. And, the final term P of D, the denominator is the, is called the marginal probability of the data. Now, importantly this marginal probability of the data is independent of G, and so it's not going to affect, the choice of which graph we select, and therefore we can ignore it in the context of a model selection problem that is of finding the single, graph or a graph that maximizes the score. And so we're going to define the Bayesian score, score B of, G, of relative to a data set D, as the log of the numerator of this expression, so the log of the marginal likelihood plus the log of the prior over graphs. Now we might think, that this score is going to avoid over fitting because of its use of the prior over refs. And although that prior can play a role, actually it turns out to be a significantly less important role than the first of these terms, which is the marginal likelihood. So, let's look at the marginal likelihood in a little more depth. So, the marginal likelihood, p of d given g, is not as it happens, the same as the log likelihood. Because that marginal likelihood is something that integrates over all possible settings of the parameters. And so from a mathematical perspective, what we're going to do is we're going to introduce a variable, theta G into the probability expression. And then we're going to integrate it out which is the same as summing it out. Only it's a continuous space, so we're going to use integrals. So P of D given G is the interval over all possible model parameterizations theta G of the probability of d given g and theta g. Times the probability of theta G given G. Now the first of these two terms is the likelihood which is exactly the the component that we used in the log likelihood score. But, importantly, unlike the likelihood score, we're not computing P of D with, given G. And theta G just for the maximum likelihood parameters, theta had G. Rather we're computing this probability averaged over all possible parameter settings, which gives us a considerably less optimistic assessment of the probability of the data given a particular structure, because we have to take into consideration not just a single parameter of setting that is geared exactly toward our data set, which is the state of half set parameters, but rather averaging out over all possible parameter settings theta G using the prior over parameters. So, that less optimistic assessment is sort of one intuition as to why this might not over fit as much. But it turns out there is another perhaps even more intuitive explanation as to why this square reduces over fitting. So let's look at the, this marginal likelihood term, P of D given G. And just write, rewrite it as the probability of all the instances X1 of the XM given G. And what we're going to do is, we're going to now break up this joint distribution using the chain rule for probabilities. Not the chain rule for Bayesian networks. Just the chain rule for probabilities and we can write that as the probability of X1 given G times the probability of X2 given X1 and G tomes the probability of X3 given X1 and X2 times blah, blah, blah up to the probability of XM given all of the previous ones X1 up to XM -1 and G. And if we look at each of these terms each of these is effectively making a prediction over an unseen instance say XM given X1 up to XM - 1. So you can think of it as almost doing some kind of cross validation or generalization or estimate of generalization ability because we're estimating the ability to predict an unseen instance given the previous ones. And so the probability of D in some sense incorporates into it, some kind of analysis of generalization ability. Now you might say well surely the likelihood does the, the standard likelihood score does exactly the same thing. But a little bit more thought reveals that if we want to do this kind of analysis for the maximum likelihood set of parameters that maximum likelihood set of parameters theta hat G depends on all of D. All of the instances D, in D, and so we can't break it down in this way, because because D of, because the, if we had theta had G on the right hand side, it's already incorporating into it, all of the instances including the unseen ones, which is yet another intuition as to why that like, the, the maximum core tends to over fit. Now, the maximum, now the Bayesian score, might look a little bit scary, because it has all this integral in it, and, and who knows how we would ever compute it. It turns out that for the case of multinomial Bayesian networks the the likelihood, the Bayesian score, can actually be written in closed form using function called the gamma function. The gamma function is as written over here it's also integral but it turns out that the gamma function is really just a continuous extension of the factorial function. Because gamma X satisfies the quality gamma X is = to x * gamma x - 1. And most computers have an implementation of the gamma function in the. So with the gamma function we can actually rewrite the probability of D given G as a product that looks as follows. The first is first we have a product over all variables i. And then we have two different products included in this big this big outer product. The first is a product over the parent set UI of multiplying over all possible values of the parent set. And here we have an expression involving gamma functions. Where this expression involves both the Dirichlet prior parameters. The alphas, as well as the sufficient statistics. So the first product enumerates overall possible assignments to the parents. And then we have an inner product, yet another inner product which innumerates overall possible values of the variable itself. so the variable, all possible values XIJ of the variable XI. And once again we have gamma terms that involve both the prior, as well as the sufficient statistics. And the, while this expression might look still a little bit scary, it it's something that one can just plug into the computer and, and compute quite easily. Another valuable property of this marginal likelihood, if we look at it a little bit more and we also take its logarithm is we see that this expression is, which was originally a product over variables i of some, complicated expression relating to xi. Is actually something that if we take the log turns into a sum over i of the family score something that could be viewed as the family score for variable xi and, its parent set of together. And so, just like other scores that we've seen before. This this scoring function can decomposes as the sum over i of terms involving only xi and its parents. And we'll see that this property can be quite important from a computational perspective. How do we sum up building up. From the marginal likelihood the second term in this expression is log of P of G. And so in order to accommodate for that, we need to have a prior over P of G. And there's a variety of different priors of people views. it turns out that a quite common choice is to simply make P of G constant and that although it doesn't impose explicitly any penalty on complexity, it turns out that it works quite well because of the penalty imposed on complexity by the marginal likelihood. But if I want to build in additional penalties on complexity, we can make the prior something that exponentially penalizes the number of edges or penalizes the number of parameters and that induces additional scarcity. Now importantly, we don't actually want to define a prior probability over graph structures. And make sure that it correctly normalizes over all possible structures. And even more so over all possible acyclic structures but fortunately we don't have to because the normalizing constant in this distribution P of G. Is this constant across different networks, and, and can therefore be ignored completely. And we only need to consider the term the changes with the changes with the graph structure and ignore the normalizing constant or partition function. That's the structure prior. What about the parameter prior? The parameter prior that we use in the context of the Bayesian score the one that's most commonly used is something called the B to E prior. And the B to E prior is something that we've actually seen before when we were talking about parameter estimation for Bayesian networks. And as a reminder, the B to E prior is defined by two components, one is an equivalent sample size alpha which is the total number of instances that we might have seen in some imaginary universe and the second is a probability distribution p0 typically encoded by a prior Bayesian network. B0, which encodes sort of our prior beliefs about about the universe. And, the And so we define now the imaginary counts that we might have seen for a particular combination of values for little xi. In particular assignments for its parents to be, alpha which is the total number of counts that we've seen times the probability of that particular event. Xi, PAI of G. In the network, in a prior network B0. Now an important note is that the parents of the variable xi, in the graph G is not the same as the parents of xi in the prior network B0. In fact, in many cases, we choose the network in the network B0 to be one that has no edges where all the variables are independent. but we compute the probability distribution in b0, and then use that to compute the the, hyper parameters alpha, that are used in the context of the of them of the learning problem for for for given graph G. And the important thing is now, that the single network be zero provides us with priors for all of the candidate networks, so we don't need to ellicit, priors for exponentially many networks, we have a the single network B0, and a single equivalent sample size and we can use that to compute the parameter prior P of theta given G over all possible networks that we're interested in evaluating. Now, other than it's convenience, why this prior as oppose to others. It turns out that one can show that this prior the unique. Prior. Over, multinomial Bayesian networks with the important property that if two networks are i equivalent, then they have the same Bayesian score. That is, if we use a different set of Dirichlet priors, other than one that fits this mold. We would have a situation where you might have two networks G and G prime that were i equivalent that have, different Bayesian scores and there's really, no justification for that, in terms of, incorporating that into the parameter priors, because, these networks are completely equivalent in their ability to represent the probability distribution, or the same set of probability distributions so why would we, have one of them, have a different Bayesian scores than others or rather, if we do have some prior knowledge that one of these graphs is is more suitable than the other we should put that into our structure prior. Not into our parameter prior, one interesting property of the BDE score has to do with its asymptotic behavior. So, let's consider what happens to the BDE score as, or in fact the Bayesian score in general, as M goes to infinity. and it turns out that as M goes to infinity, a network G with Dirichlet priors satisfies the following, in terms of the marginal likelihood. That the law of the marginal likelihood is equal to the sum of these three terms. The first of these is just a log likelihood of the data given the maximum likelihood of parameters. Theta hat G so this is just a likelihood score. Which we've already seen has some good properties in terms of trying to fit the data. But on the other hand, is also liable to overfit. The second term, however, is log of m over two where m is the number of instances times then G which is the number of independent parameters. Which as we've seen in the context of a multinomial distribution is the number of entries in the distribution minus one. And this term has a negative sign which means that we are going, this log, the log of the marginal likelihood is going to be decreased as we increase the number of parameters and so there is a tension between those two terms. The first of these tries to have more complicated networks in order to maximize the fit to data, as we've already seen. And the second tries to reduce the complexity of the model by increasing, decreasing the number of parameters. The third term, is this term O of one, which indicates in formal notation, that this third term is effectively constant relative to M which means it doesn't grow with M. With the number of instances which means that as the number of instances grows, only these first two terms are going to play a role in terms of evaluating which structure is going to be selected. Now these first two terms, as it happens, have a name and that, name is something, called the BIC score, which really looks just add the likelihood component and the penalty term, and as and in a different part of this course we talk about the BIC score, and some of the properties that it has, specifically, for example, the fact that as M grows to infinity, the score is consistent which means that the graph or one of its i the, if the correct graph for one of its I equivalent graphs is going to have the highest score among all possible graphs that we're considering. So this is yet another demonstration that the Bayesian score is a way of avoiding over fitting. Because, at the large sample limit, we will learn a correct graph. Either the correct graph or one of its I equivalent, structures. To summarize, the Bayesian score uses Bayesian principles. Specifically averaging out over parameters that we have uncertainty over, to avoid over-fitting. the Bayesian score is most often instantiated as a particular variant called BDE. Which requires that we assess a prior network in order to compute the prior over parameters. But that also allows us to, a natural place to incorporate prior knowledge into the learning algorithm. And the BDE prior has the important property that I equivalent networks have the same score. The Bayesian score is at the large sample limit equivalent to a different score called VIC which we analyse seperatley in this course. And as a consequence of this equivalence one can show that it's asymptotically consitent in that it learns the correct network as M goes to infinity. Now what this asymptotic behaviour, is an important one, it's also important realize that we don't often have a a very, very large number of samples, or at least not in, in a large number of applications and so it's important to consider also the behaviour for a more reasonable number of samples N. And in this case it's it can be shown that the BIC score tends to under fit the number of tends to under fit the model structure in the sense that it'll learn models that are too sparse.