Let's look at the asymptotic behavior of this penalized score.

We've already seen that in the context of the likelihood score, it really doesn't

matter how much training data we have. We will almost always pick the most

densely connected network that our assumptions allow.

This is no longer the case when we have this penalized score.

So let's, to understand that, let's breakdown the first of these two terms,

which is the likelihood score. And remind ourselves that at least in the

context of multinomial networks the likelihood score

Can be re-written as, in the following way, so this is the, the breakdown of the

likely score. And it has these two terms, the first is

the number of data instances M times the sum over the variables X I, of the mutual

information between X I and it's parents in the network.

And that mutual information is relative to the empirical distribution P-hat.

The second term in the likelihood score is a term which is also M times the sum

over the variables of the entropy of the variable, again, in the empirical

distribution. And, as we've already discussed before,

this term is independent of G. And so, doesn't affect the choice of

which structure is selected. Because it's the same for all structures.

And so we have these two terms that are playing off against each other.

We have the term over here, the red term. Which is M times the sum of the mutual

information. And we have the second term, the blue

term which is the log of M over two times the number of independent parameters in

G. Now, if we consider these two terms, we

see that the mutual information term grows linearly with M.

Whereas the complexity term grows logarithmic-ally with M.

And so as we get more and more data instances, we put more emphasis on the

fit to data and less emphasis on the model complexity, so intuitively we would

infer that, we would, as we have more data instances, we'd be more, amenable to

learning more complicated structures. So that property gives rise to a very

important result that we can state to regarding the DIC score.

And that is a result called, consistency. Consistency tells us what behavior we

would get, what network we would learn As the number of samples grows to

infinity. And so here we're going to assume that

the data is actually generated from a particular true structure g star.

So there is a g star because otherwise the result doesn't really make can't

really be stated. And what the consistency property says is

that as m grows to infinity, the true structure g star.

Max is going to be the one that maximizes the square.

Now that by itself is not quite right, as, as we can see, because the true

structure G star might have several, other structures that are I equivalent to

it, and we've already seen that, the likelihood score and, in fact it also

turns out to be the case that the, Penalty term are the same for eye

equivalent networks. And so it's not just the true structure

alone that will maximize the score, it's the true structure or any other structure

that's eye equivalent with. But as far as we're concerned that's fine

because we've effectively learned the correct representation of the probability

distribution. So to understand why this result is true,

we're going to give just a very high-level intuitive argument.

We're not going to prove the theorem. It's a little bit tricky to prove

completely formally. so let's first consider the case of why

this is not going to overfit. That is, why we're not going to have

spurious edges learned in maximizing the score, as the number of instances grows

to infinity. So here we go, we go back to we got back

and look at this formula over here. And we see that as the n grows to

infinity, p hat is going to approach p star, where p star is my true underlying

distribution. And so what we're going to have in this

first term over here is effectively the mutual information relative to p star

between xi and its parents. And in that p star, the mutual

information between Xi and its parents is,

You don't get any benefit from adding additional parents, spurious parents.

That is the mutual information in P star for spurious parents is not going to

grow. Because in the true underlying

distribution P star. the, there, the,

There is no Additional independence.

There's no additional correlation that we have.

And so, at that point, we're going to have that the more complicated structure

is about the same as G star in terms of this first term, but the spurious edges

are going to cost us. In terms of the number of parameters in

the blue term, and so, these spurious edges will not contribute to the like, to

the data likelihood term, but will be penalized more, and so we will, choose

the sparser network that corresponds to G star.

Conversely why do we not under-fit? That is why do we eventually learn all

correct edges. And that is because the data likelihood

term tells us that edges that are required that is edges that RNG star for

example or the I equivalent network. Or will If we don't have them there this

mutual information term will be lower than it could be.

And so we will have a higher score by including those terms in the mall.

And because this mutual information term. Grows with M linearly versus the penalty

term which grows log rhythmically then eventually the first term will dominate

and we will have a, it will be beneficial for the learning algorithm to add that

pitch. The required edge and g star.

So, this is a high level, very hand wavy argument, but at least it gives an

intuition as to why consistency holds. So to summarize, the BIC score, is a very

simple score, that trades off, model complexity with, the fit, to data, and,

therefore, has the important property of asymptotic consistency which means that if

the data were actually generated by a network G star, for, which is a perfect

map for the distribution, then, either G star or networks I equivalent would, will

have the highest score, as an N grows to infinity.