We talked about base net structure learning as the task of optimizing a score over a space of network structures and one of the big design twists that we have to make in this context is which scoring function to use. Our first attempt at that was to use the likelihood score but as we've already seen the likelihood score is very susceptible to over fitting and we'll invariably learn the most complicated network that it can given the constrains that we impose on our space. We now consider a different approach in which rather then constraining the space or maybe in addition to constraining the space we're also going. To impose a penalty on the complexity of the structure that we learned so as to force the model to trade, so as to force the learning algorithm to trade off model complexity with the fit for data. And specifically the model that the specifically the score we're going to look at right now is something called the BIC score. So, let's look at one way of penalizing complexity, which is the one that the BIC score uses. The BIC score has two terms in it. The first is just the likelihood of the, Graphs and its maximum likely of parameters relative to the data. This is a familiar term. This is the term that represents the the same thing as the likelihood score. So this term is just score L of G relative to D, and were we to use this in isolation, we would get the same over-fitting behavior. Though we add to this a penalty term, this is the second term here on the right. And that term is, is. Subtract log of M over two times the dimension of G. So let's understand what these different pieces are. M is the number of training instances. And the dimension of G is the number of independent parameters. In the network. So we talked about the concept of independent parameters in the context of multinomial networks. And, as a reminder, a multinomial distribution has one fewer independent parameters than the number of entries in the multinomial. And from that we can compute the number of independent parameters for any multinomial network. So this basically counts the number of, of degrees of freedom that we have in the network in terms of estimating independent parameters in it. And so these two terms balance each other out. We've already seen that the term on the left, the log likelihood, tries to push towards a better fit to the training data, whereas the term on the right, the log, the the penalty term is going to try and keep the number of independent parameters and therefore the network complexity down. And so, this score provides us with some kind of trade-off between fit to data and model complexity. Now this is one choice in terms of this trade-off and in fact there are other scores that use a different trade-off between those two terms but this one is one that is very commonly used in practice and has several distinct motivations. Some one, some of which we'll talk about and others, and others not. One comment that's worth making though is that the negation of the score. So if we negate this entire term that is often called. the MVL criteron where MVL stands for Minimum. Description. Length. And so, in fact, this notion of minimum description length is an information theoretic justification for this, and the other justification for this is one that's derived from a more Bayesian criterion, which is why BIC stands for Bayesian. Information criteria. Let's look at the asymptotic behavior of this penalized score. We've already seen that in the context of the likelihood score, it really doesn't matter how much training data we have. We will almost always pick the most densely connected network that our assumptions allow. This is no longer the case when we have this penalized score. So let's, to understand that, let's breakdown the first of these two terms, which is the likelihood score. And remind ourselves that at least in the context of multinomial networks the likelihood score Can be re-written as, in the following way, so this is the, the breakdown of the likely score. And it has these two terms, the first is the number of data instances M times the sum over the variables X I, of the mutual information between X I and it's parents in the network. And that mutual information is relative to the empirical distribution P-hat. The second term in the likelihood score is a term which is also M times the sum over the variables of the entropy of the variable, again, in the empirical distribution. And, as we've already discussed before, this term is independent of G. And so, doesn't affect the choice of which structure is selected. Because it's the same for all structures. And so we have these two terms that are playing off against each other. We have the term over here, the red term. Which is M times the sum of the mutual information. And we have the second term, the blue term which is the log of M over two times the number of independent parameters in G. Now, if we consider these two terms, we see that the mutual information term grows linearly with M. Whereas the complexity term grows logarithmic-ally with M. And so as we get more and more data instances, we put more emphasis on the fit to data and less emphasis on the model complexity, so intuitively we would infer that, we would, as we have more data instances, we'd be more, amenable to learning more complicated structures. So that property gives rise to a very important result that we can state to regarding the DIC score. And that is a result called, consistency. Consistency tells us what behavior we would get, what network we would learn As the number of samples grows to infinity. And so here we're going to assume that the data is actually generated from a particular true structure g star. So there is a g star because otherwise the result doesn't really make can't really be stated. And what the consistency property says is that as m grows to infinity, the true structure g star. Max is going to be the one that maximizes the square. Now that by itself is not quite right, as, as we can see, because the true structure G star might have several, other structures that are I equivalent to it, and we've already seen that, the likelihood score and, in fact it also turns out to be the case that the, Penalty term are the same for eye equivalent networks. And so it's not just the true structure alone that will maximize the score, it's the true structure or any other structure that's eye equivalent with. But as far as we're concerned that's fine because we've effectively learned the correct representation of the probability distribution. So to understand why this result is true, we're going to give just a very high-level intuitive argument. We're not going to prove the theorem. It's a little bit tricky to prove completely formally. so let's first consider the case of why this is not going to overfit. That is, why we're not going to have spurious edges learned in maximizing the score, as the number of instances grows to infinity. So here we go, we go back to we got back and look at this formula over here. And we see that as the n grows to infinity, p hat is going to approach p star, where p star is my true underlying distribution. And so what we're going to have in this first term over here is effectively the mutual information relative to p star between xi and its parents. And in that p star, the mutual information between Xi and its parents is, You don't get any benefit from adding additional parents, spurious parents. That is the mutual information in P star for spurious parents is not going to grow. Because in the true underlying distribution P star. the, there, the, There is no Additional independence. There's no additional correlation that we have. And so, at that point, we're going to have that the more complicated structure is about the same as G star in terms of this first term, but the spurious edges are going to cost us. In terms of the number of parameters in the blue term, and so, these spurious edges will not contribute to the like, to the data likelihood term, but will be penalized more, and so we will, choose the sparser network that corresponds to G star. Conversely why do we not under-fit? That is why do we eventually learn all correct edges. And that is because the data likelihood term tells us that edges that are required that is edges that RNG star for example or the I equivalent network. Or will If we don't have them there this mutual information term will be lower than it could be. And so we will have a higher score by including those terms in the mall. And because this mutual information term. Grows with M linearly versus the penalty term which grows log rhythmically then eventually the first term will dominate and we will have a, it will be beneficial for the learning algorithm to add that pitch. The required edge and g star. So, this is a high level, very hand wavy argument, but at least it gives an intuition as to why consistency holds. So to summarize, the BIC score, is a very simple score, that trades off, model complexity with, the fit, to data, and, therefore, has the important property of asymptotic consistency which means that if the data were actually generated by a network G star, for, which is a perfect map for the distribution, then, either G star or networks I equivalent would, will have the highest score, as an N grows to infinity.