1:28

So, let's look at some of the let's look at a simple example to see what the what

Â the likelihood score looks like. So here, we have distribution over two

Â random variables X and Y, and let's consider two graphs.

Â G0 on the left has no edge between X and Y,

Â and G1 on the right has an edge from X to Y.

Â And the Y to X case would be symmetrical, so we're just going to look at these two

Â graphs. The likelihood score of a graph of this

Â graph relative to, to D by the decomposition of the likelihood function

Â that we've already seen, is the sum over all instances m of the log of the

Â parameter for the value of X in the m-th data instance,

Â plus the log of the parameter for Y in the m, for the Y value in the m-th data

Â instance. So, this is just the decomposition of the

Â log likelihood that we use when we talked about parameter estimation.

Â Conversely, when we look at G1, we have a very similar formula where, again, we sum

Â over all instances. And for the X that has no parents, we

Â have log of theta hat Xm where, again, Xm is the value of X in the mth data

Â instance. And for the Y value, we have log of theta

Â hat of Ym given Xm. And in both cases, theta hat are the

Â maximum likelihood parameters in the respective graphs.

Â So, on the left, it's theta hat for the graph G0,

Â and on the right it's the maximum length the theta hat for the graph G1.

Â Now, let's refine this analysis a little bit more by converting it to a somewhat

Â different notation. first we're going to look at the

Â difference between these two likelihood scores.

Â So, we're going to see, we're going to look at the score for G1

Â and subtract away the score for G0, and then we're going to see if that score is

Â positive or negative. That is, if we prefer G1 to G0 or not.

Â So, notice that the term over here, the first term,

Â the X term, is the same in both of these graphs.

Â And so, it cancels out. And so, what we have left over is the sum

Â over m of log of theta hat Ym given Xm minus log of theta hat of Ym.

Â Now, we can reformulate this by looking at the

Â sufficient statistics. And specifically, we're going to have the

Â term theta hat Ym given Xm occur recur every time we have a particular value Y

Â and a particular value X. So, we're going to sum up over all values

Â little x and little y, and the parameter log of theta hat y given x is going to

Â occur m of xy times, where m of xy is the sufficient statistics or the counts for

Â that particular configuration of events in the data set D.

Â The second term, the, the one that we're subtracting, is a

Â sum over y of m of y log of theta hat of y.

Â Where again, m of y is a sufficient statistics.

Â Now, we're going to rewrite this in terms of this distribution over here called P

Â hat. P hat is what's called the empirical

Â distribution. It's what happens when we look at our

Â data set D, and just look at the frequencies of different events. So, m of

Â xy is simply the number of data instances, m, times P hat of xy, because

Â this is the frequency of the event and this is the total number of data

Â instances and so, m of xy is just the product of those two.

Â So, we can write the first term as m times sum of xy P hat of xy.

Â And interestingly, we can also rewrite the the maximum likelihood estimation

Â parameters in terms of P hat as well, because theta hat of y given x is simply

Â the, the fraction of the y cases among all the x cases, which is exactly the

Â same, as P hat of y given x. Similarly, the second term turns into m

Â times P hat times the sum over y P hat of y, log of P hat of y.

Â So, that now converted all of the expressions, the m's and the n theta's

Â into a single vocabulary which are these empirical distributions p hat.

Â So now let's take the M out of the equation.

Â And furthermore, look at the sum over y as the sum over pairs xy.

Â So we've artificially introduced a variable x into the second summation,

Â sum over y, P hat of y. And now, we're going to rewrite that as

Â sum over xy, P hat of xy. And because x doesn't appear in the

Â expression log P hat of y, we can we can do that because sum over x,

Â P hat of xy is equal to P hat of y. So, looking at these two expressions

Â together, we can now move around some things by utilizing properties of the

Â logarithm and derive that the following, that this is equivalent to the following

Â equation. This is M times the sum over xy, P hat of

Â xy log P hat of xy divided by P hat of x, P hat of y.

Â And the reason why that's the case is that P hat of y given x is equal to P hat

Â of y, x divided by P hat of y. sorry.

Â Divided by P hat of x. And now by moving around the logarithms,

Â we get exactly this expression. Now importantly, this expression here

Â inside the sum the summation over here has a name.

Â And that summation is called the mutual information,

Â and is usually denoted as I sub P hat, in this case the distribution P hat between

Â x and y. So why, why is this called visual

Â information? Its because it measures if you, if you

Â look at this the distance on average between the joined distribution x and y,

Â P had of xy in the numerator relative to the distribution we would get if we, if

Â the distribution was a product of marginals.

Â So, P hat of x times P hat of y. So, you can think of this term inside the

Â log as a relative error if you will, a distance between the joint distribution

Â and the product of the marginals. And now, we're taking the average of the

Â log of this expression averaged by how frequent the different cases xy are.

Â And so that's so you can think of it as an average distance between the joint

Â distribution and the one that we would get if this was a product of marginals.

Â So, this is an information theoretical property which, as I said, is called the

Â neutral information. .

Â And so, if we now generalize this analysis to an arbitrary network it turns

Â out that the likelihood score for any graph can be viewed as the can be

Â rewritten as the number of data instances m times the sum over all variables I of

Â the mutual information again between a node and its parents in the graph,

Â minus the constant term that is constant relative to the graph structure.

Â The second term is M times the sum of the entropies of the individual variables,

Â and this term doesn't depend on G. Now, why is this a significant result?

Â It's significant because it tells us that the value of a network, the score of the

Â network if you use the likelihood score is higher, so score is higher if Xi is

Â correlated with his parents, which is a very intuitive property. The

Â more a variable is correlated with a parent the better the network structure.

Â Which means, we would want to put as a parent of a variable a parent that is

Â highly correlated with it. Which is kind of exactly the behavior

Â that you would intuitively hope for. So, this is a good result because it

Â tells us that the parents that we pick for a variable are the ones that are

Â have, have the highest correlation with it, the ones that have the highest mutual

Â information. And that's a very satisfying result.

Â However, as we now show, this mutual information result also has some negative

Â consequences. So, to understand that, let's go back to

Â our simple example and let's look at the difference between the scores of these

Â two graphs. The graph G1 on the left that has the

Â edge, and the graph G0 on the right that doesn't.

Â And let's look at the difference between those two scores.

Â That difference, if it's positive, suggests that we should pick the graph

Â G1. And if it's negative, tells us that the

Â maximum likelihood score will pick the graph G0.

Â And that difference is M, the number of samples times the mutual information

Â between the variables X and Y in the empirical distribution P hat.

Â Now, a well known result from information theory is that the neutral information,

Â this this quantity over here is always non negative, for any distribution P.

Â Furthermore, this mutual information is equal to zero.

Â That is this inequality turns into an equality if and only if X and Y are

Â actually independent. In the distribution relative to which we're computing the

Â mutual information, which in this case is the distribution P

Â hat. Now, even if X and Y were actually

Â independent in the original distribution, the one that generated the training

Â instances, it is still very unlikely that we will achieve exact and perfect

Â independence in the empirical distribution just because of statistical

Â fluctuations in the set of samples that are generated. And so, even if X and Y

Â are independent, it is almost never the case that they're are independent in P

Â hat. Which means that in almost all of the

Â cases, this mutual information between X and Y

Â is going to be greater than zero, almost always.

Â 12:46

Which tells us that adding this edge can never hurt and almost always helps.

Â And that's true not just in this simple example, but in other cases as well.

Â That is, it almost always is better in terms of the likelihood score to have

Â more edges rather than fewer edges. That will always increase the likelihood

Â score. Which means that in general except for

Â very unusual circumstances, the likelihood score will be maximized for

Â the fully-connected network. So, that gives rise to a very significant

Â over fitting effect. Because as we've already seen, the num

Â the more edges we have the more difficult it is to fit the perimeters because we

Â have fragmentation of our data set into these tiny little buckets each of which

Â has a very small number of instances in it.

Â So, how do we avoid over-fitting? It turns out that there are several

Â different strategies that are typically employed.

Â The first is to restrict the hypothesis base to basically prevent the algorithm

Â from over-fitting. And we can do that by restricting the

Â number of parents that we allow per node, or some kind of restriction on number of

Â parameters. This is usually easier to enforce and so

Â that's a more common strategy. A somewhat more flexible strategy is to

Â use a scoring function that makes a better set of trade-offs.

Â That is, where there is a penalty on the complexity of the model at the same time

Â that we're trying to get a good fit to the training data.

Â And that's a more flexible strategy than a hard constraint on the model

Â complexity. Because if there is a very strong signal

Â for some correlation between a pair of variables, that can eventually outweigh

Â the complexity penalty and allow that edge to be added in.

Â Whereas, if we have a hard constraint, we might

Â never be able to learn a model that is the appropriate one, simply because it's

Â it's never going to be it's, it's not going to fit into the restrictive

Â hypothesis space. These complexity penalizing scores

Â generally fall into two categories. There's ones that explicitly penalize

Â complexity, and then there's the class of models called the Bayesian, class of

Â scores called the Bayesian scores which as we'll see average over all possible

Â parameter values following the Bayesian paradigm that says, that anything we have

Â uncertainty over we should have a distribution over.

Â And as we'll see, that gives rise naturally to a score that avoids over

Â fitting. To summarize the likelihood score is a

Â score that evaluate a model G by looking at the log likelihood of the data

Â relative to G using the MLE parameters for G.

Â And, that set of parameters was optimized to maximize the likelihood of D.

Â And therefore, is very, very well geared to trying to capture the exact

Â characteristics of the data, for better and for worse.

Â So, for better, it gives us a very nice information theoretic interpretation of

Â the set of edges that are chosen, and the set of parents that are chosen for a

Â given variable, in terms of the dependencies that

Â that we encode in the graph G. Conversely, that same characterization

Â also shows us that we're guaranteed to overfit the model to the training data,

Â unless we somehow impose constraint, or otherwise penalize model complexity.

Â