0:00

We talked about base net structure learning as the task of optimizing a

Â score over a space of network structures and one of the big design twists that we

Â have to make in this context is which scoring function to use.

Â Our first attempt at that was to use the likelihood score but as we've already

Â seen the likelihood score is very susceptible to over fitting and we'll

Â invariably learn the most complicated network that it can given the constrains

Â that we impose on our space. We now consider a different approach in

Â which rather then constraining the space or maybe in addition to constraining the

Â space we're also going. To impose a penalty on the complexity of

Â the structure that we learned so as to force the model to trade, so as to force

Â the learning algorithm to trade off model complexity with the fit for data.

Â And specifically the model that the specifically the score we're going to

Â look at right now is something caled the BIC score.

Â So, let's look at one way of penalizing complexity, which is the one that the BIC

Â score uses. The BIC score has two terms in it.

Â The first is just the likelihood of the, Graphs and its maximum likely of

Â parameters relative to the data. This is a familiar term.

Â This is the term that represents the the same thing as the likelihood score.

Â So this term is just score L of G relative to D, and were we to use this in

Â isolation, we would get the same over-fitting behavior.

Â Though we add to this a penalty term, this is the second term here on the

Â right. And that term is, is.

Â Subtract log of M over two times the dimension of G.

Â So let's understand what these different pieces are.

Â M is the number of training instances. And the dimension of G is the number of

Â independent parameters. In the network.

Â So we talked about the concept of independent parameters in the context of

Â multinomial networks. And, as a reminder, a multinomial

Â distribution has one fewer independent parameters than the number of entries in

Â the multinomial. And from that we can compute the number

Â of independent parameters for any multinomial network.

Â So this basically counts the number of, of degrees of freedom that we have in the

Â network in terms of estimating independent parameters in it.

Â And so these two terms balance each other out.

Â We've already seen that the term on the left, the log likelihood, tries to push

Â towards a better fit to the training data, whereas the term on the right, the

Â log, the the penalty term is going to try and keep the number of independent

Â parameters and therefore the network complexity down.

Â And so, this score provides us with some kind of trade-off between fit to data and

Â model complexity. Now this is one choice in terms of this

Â trade-off and in fact there are other scores that use a different trade-off

Â between those two terms but this one is one that is very commonly used in

Â practice and has several distinct motivations.

Â Some one, some of which we'll talk about and others, and others not.

Â One comment that's worth making though is that the negation of the score.

Â So if we negate this entire term that is often called.

Â the MVL criteron where MVL stands for Minimum.

Â Description. Length.

Â And so, in fact, this notion of minimum description length is an information

Â theoretic justification for this, and the other justification for this is one

Â that's derived from a more Bayesian criterion, which is why BIC stands for

Â Bayesian. Information criteria.

Â 4:07

Let's look at the asymptotic behavior of this penalized score.

Â We've already seen that in the context of the likelihood score, it really doesn't

Â matter how much training data we have. We will almost always pick the most

Â densely connected network that our assumptions allow.

Â This is no longer the case when we have this penalized score.

Â So let's, to understand that, let's breakdown the first of these two terms,

Â which is the likelihood score. And remind ourselves that at least in the

Â context of multinomial networks the likelihood score

Â Can be re-written as, in the following way, so this is the, the breakdown of the

Â likely score. And it has these two terms, the first is

Â the number of data instances M times the sum over the variables X I, of the mutual

Â information between X I and it's parents in the network.

Â And that mutual information is relative to the empirical distribution P-hat.

Â The second term in the likelihood score is a term which is also M times the sum

Â over the variables of the entropy of the variable, again, in the empirical

Â distribution. And, as we've already discussed before,

Â this term is independent of G. And so, doesn't affect the choice of

Â which structure is selected. Because it's the same for all structures.

Â And so we have these two terms that are playing off against each other.

Â We have the term over here, the red term. Which is M times the sum of the mutual

Â information. And we have the second term, the blue

Â term which is the log of M over two times the number of independent parameters in

Â G. Now, if we consider these two terms, we

Â see that the mutual information term grows linearly with M.

Â Whereas the complexity term grows logarithmic-ally with M.

Â And so as we get more and more data instances, we put more emphasis on the

Â fit to data and less emphasis on the model complexity, so intuitively we would

Â infer that, we would, as we have more data instances, we'd be more, amenable to

Â learning more complicated structures. So that property gives rise to a very

Â important result that we can state to regarding the DIC score.

Â And that is a result called, consistency. Consistency tells us what behavior we

Â would get, what network we would learn As the number of samples grows to

Â infinity. And so here we're going to assume that

Â the data is actually generated from a particular true structure g star.

Â So there is a g star because otherwise the result doesn't really make can't

Â really be stated. And what the consistency property says is

Â that as m grows to infinity, the true structure g star.

Â Max is going to be the one that maximizes the square.

Â Now that by itself is not quite right, as, as we can see, because the true

Â structure G star might have several, other structures that are I equivalent to

Â it, and we've already seen that, the likelihood score and, in fact it also

Â turns out to be the case that the, Penalty term are the same for eye

Â equivalent networks. And so it's not just the true structure

Â alone that will maximize the score, it's the true structure or any other structure

Â that's eye equivalent with. But as far as we're concerned that's fine

Â because we've effectively learned the correct representation of the probability

Â distribution. So to understand why this result is true,

Â we're going to give just a very high-level intuitive argument.

Â We're not going to prove the theorem. It's a little bit tricky to prove

Â completely formally. so let's first consider the case of why

Â this is not going to overfit. That is, why we're not going to have

Â spurious edges learned in maximizing the score, as the number of instances grows

Â to infinity. So here we go, we go back to we got back

Â and look at this formula over here. And we see that as the n grows to

Â infinity, p hat is going to approach p star, where p star is my true underlying

Â distribution. And so what we're going to have in this

Â first term over here is effectively the mutual information relative to p star

Â between xi and its parents. And in that p star, the mutual

Â information between Xi and its parents is,

Â You don't get any benefit from adding additional parents, spurious parents.

Â That is the mutual information in P star for spurious parents is not going to

Â grow. Because in the true underlying

Â distribution P star. the, there, the,

Â There is no Additional independence.

Â There's no additional correlation that we have.

Â And so, at that point, we're going to have that the more complicated structure

Â is about the same as G star in terms of this first term, but the spurious edges

Â are going to cost us. In terms of the number of parameters in

Â the blue term, and so, these spurious edges will not contribute to the like, to

Â the data likelihood term, but will be penalized more, and so we will, choose

Â the sparser network that corresponds to G star.

Â Conversely why do we not under-fit? That is why do we eventually learn all

Â correct edges. And that is because the data likelihood

Â term tells us that edges that are required that is edges that RNG star for

Â example or the I equivalent network. Or will If we don't have them there this

Â mutual information term will be lower than it could be.

Â And so we will have a higher score by including those terms in the mall.

Â And because this mutual information term. Grows with M linearly versus the penalty

Â term which grows log rhythmically then eventually the first term will dominate

Â and we will have a, it will be beneficial for the learning algorithm to add that

Â pitch. The required edge and g star.

Â So, this is a high level, very hand wavy argument, but at least it gives an

Â intuition as to why consistency holds. So to summarize, the BIC score, is a very

Â simple score, that trades off, model complexity with, the fit, to data, and,

Â therefore, has the important property of asyntotic consistency which means that if

Â the data were actually generated by a network G star, for, which is a perfect

Â map for the distribution, then, either G star or networks I equivalent would, will

Â have the highest score, as an N grows to infinity.

Â