So, we talked about the fact that we can do Bayesian structure learning is two pieces. One is defining a scoring function allowed us, allows us to evaluate different networks. And the second is, as a search procedure, an optimization procedure, that allows us to select among the networks the one that has the highest score. we're going to first talk about scoring functions, the initially we're going to talk about what is perhaps the simplest score that we can consider, which is the likelihood score. So, we've already talked about likelihood as a way of evaluating the quality of a different, of a, of a given network. And so here, the likelihood score can be defined as the graph and the parameters that maximize the likelihood or, in this case, the log likelihood of the data. So, the score, the likelihood score of a graph is the log likelihood of the graph accompanied by the parameter setting, theta hat, which is the maximum likelihood estimate of the parameters given the graph G and the data D. So first, so for any graph, we figure out what are the best possible parameters to use for that graph. And then, we use that as a way of evaluating the quality of the graph that we learned. So, let's look at some of the let's look at a simple example to see what the what the likelihood score looks like. So here, we have distribution over two random variables X and Y, and let's consider two graphs. G0 on the left has no edge between X and Y, and G1 on the right has an edge from X to Y. And the Y to X case would be symmetrical, so we're just going to look at these two graphs. The likelihood score of a graph of this graph relative to, to D by the decomposition of the likelihood function that we've already seen, is the sum over all instances m of the log of the parameter for the value of X in the m-th data instance, plus the log of the parameter for Y in the m, for the Y value in the m-th data instance. So, this is just the decomposition of the log likelihood that we use when we talked about parameter estimation. Conversely, when we look at G1, we have a very similar formula where, again, we sum over all instances. And for the X that has no parents, we have log of theta hat Xm where, again, Xm is the value of X in the mth data instance. And for the Y value, we have log of theta hat of Ym given Xm. And in both cases, theta hat are the maximum likelihood parameters in the respective graphs. So, on the left, it's theta hat for the graph G0, and on the right it's the maximum length the theta hat for the graph G1. Now, let's refine this analysis a little bit more by converting it to a somewhat different notation. first we're going to look at the difference between these two likelihood scores. So, we're going to see, we're going to look at the score for G1 and subtract away the score for G0, and then we're going to see if that score is positive or negative. That is, if we prefer G1 to G0 or not. So, notice that the term over here, the first term, the X term, is the same in both of these graphs. And so, it cancels out. And so, what we have left over is the sum over m of log of theta hat Ym given Xm minus log of theta hat of Ym. Now, we can reformulate this by looking at the sufficient statistics. And specifically, we're going to have the term theta hat Ym given Xm occur recur every time we have a particular value Y and a particular value X. So, we're going to sum up over all values little x and little y, and the parameter log of theta hat y given x is going to occur m of xy times, where m of xy is the sufficient statistics or the counts for that particular configuration of events in the data set D. The second term, the, the one that we're subtracting, is a sum over y of m of y log of theta hat of y. Where again, m of y is a sufficient statistics. Now, we're going to rewrite this in terms of this distribution over here called P hat. P hat is what's called the empirical distribution. It's what happens when we look at our data set D, and just look at the frequencies of different events. So, m of xy is simply the number of data instances, m, times P hat of xy, because this is the frequency of the event and this is the total number of data instances and so, m of xy is just the product of those two. So, we can write the first term as m times sum of xy P hat of xy. And interestingly, we can also rewrite the the maximum likelihood estimation parameters in terms of P hat as well, because theta hat of y given x is simply the, the fraction of the y cases among all the x cases, which is exactly the same, as P hat of y given x. Similarly, the second term turns into m times P hat times the sum over y P hat of y, log of P hat of y. So, that now converted all of the expressions, the m's and the n theta's into a single vocabulary which are these empirical distributions p hat. So now let's take the M out of the equation. And furthermore, look at the sum over y as the sum over pairs xy. So we've artificially introduced a variable x into the second summation, sum over y, P hat of y. And now, we're going to rewrite that as sum over xy, P hat of xy. And because x doesn't appear in the expression log P hat of y, we can we can do that because sum over x, P hat of xy is equal to P hat of y. So, looking at these two expressions together, we can now move around some things by utilizing properties of the logarithm and derive that the following, that this is equivalent to the following equation. This is M times the sum over xy, P hat of xy log P hat of xy divided by P hat of x, P hat of y. And the reason why that's the case is that P hat of y given x is equal to P hat of y, x divided by P hat of y. sorry. Divided by P hat of x. And now by moving around the logarithms, we get exactly this expression. Now importantly, this expression here inside the sum the summation over here has a name. And that summation is called the mutual information, and is usually denoted as I sub P hat, in this case the distribution P hat between x and y. So why, why is this called visual information? Its because it measures if you, if you look at this the distance on average between the joined distribution x and y, P had of xy in the numerator relative to the distribution we would get if we, if the distribution was a product of marginals. So, P hat of x times P hat of y. So, you can think of this term inside the log as a relative error if you will, a distance between the joint distribution and the product of the marginals. And now, we're taking the average of the log of this expression averaged by how frequent the different cases xy are. And so that's so you can think of it as an average distance between the joint distribution and the one that we would get if this was a product of marginals. So, this is an information theoretical property which, as I said, is called the neutral information. . And so, if we now generalize this analysis to an arbitrary network it turns out that the likelihood score for any graph can be viewed as the can be rewritten as the number of data instances m times the sum over all variables I of the mutual information again between a node and its parents in the graph, minus the constant term that is constant relative to the graph structure. The second term is M times the sum of the entropies of the individual variables, and this term doesn't depend on G. Now, why is this a significant result? It's significant because it tells us that the value of a network, the score of the network if you use the likelihood score is higher, so score is higher if Xi is correlated with his parents, which is a very intuitive property. The more a variable is correlated with a parent the better the network structure. Which means, we would want to put as a parent of a variable a parent that is highly correlated with it. Which is kind of exactly the behavior that you would intuitively hope for. So, this is a good result because it tells us that the parents that we pick for a variable are the ones that are have, have the highest correlation with it, the ones that have the highest mutual information. And that's a very satisfying result. However, as we now show, this mutual information result also has some negative consequences. So, to understand that, let's go back to our simple example and let's look at the difference between the scores of these two graphs. The graph G1 on the left that has the edge, and the graph G0 on the right that doesn't. And let's look at the difference between those two scores. That difference, if it's positive, suggests that we should pick the graph G1. And if it's negative, tells us that the maximum likelihood score will pick the graph G0. And that difference is M, the number of samples times the mutual information between the variables X and Y in the empirical distribution P hat. Now, a well known result from information theory is that the neutral information, this this quantity over here is always non negative, for any distribution P. Furthermore, this mutual information is equal to zero. That is this inequality turns into an equality if and only if X and Y are actually independent. In the distribution relative to which we're computing the mutual information, which in this case is the distribution P hat. Now, even if X and Y were actually independent in the original distribution, the one that generated the training instances, it is still very unlikely that we will achieve exact and perfect independence in the empirical distribution just because of statistical fluctuations in the set of samples that are generated. And so, even if X and Y are independent, it is almost never the case that they're are independent in P hat. Which means that in almost all of the cases, this mutual information between X and Y is going to be greater than zero, almost always. Which tells us that adding this edge can never hurt and almost always helps. And that's true not just in this simple example, but in other cases as well. That is, it almost always is better in terms of the likelihood score to have more edges rather than fewer edges. That will always increase the likelihood score. Which means that in general except for very unusual circumstances, the likelihood score will be maximized for the fully-connected network. So, that gives rise to a very significant over fitting effect. Because as we've already seen, the num the more edges we have the more difficult it is to fit the perimeters because we have fragmentation of our data set into these tiny little buckets each of which has a very small number of instances in it. So, how do we avoid over-fitting? It turns out that there are several different strategies that are typically employed. The first is to restrict the hypothesis base to basically prevent the algorithm from over-fitting. And we can do that by restricting the number of parents that we allow per node, or some kind of restriction on number of parameters. This is usually easier to enforce and so that's a more common strategy. A somewhat more flexible strategy is to use a scoring function that makes a better set of trade-offs. That is, where there is a penalty on the complexity of the model at the same time that we're trying to get a good fit to the training data. And that's a more flexible strategy than a hard constraint on the model complexity. Because if there is a very strong signal for some correlation between a pair of variables, that can eventually outweigh the complexity penalty and allow that edge to be added in. Whereas, if we have a hard constraint, we might never be able to learn a model that is the appropriate one, simply because it's it's never going to be it's, it's not going to fit into the restrictive hypothesis space. These complexity penalizing scores generally fall into two categories. There's ones that explicitly penalize complexity, and then there's the class of models called the Bayesian, class of scores called the Bayesian scores which as we'll see average over all possible parameter values following the Bayesian paradigm that says, that anything we have uncertainty over we should have a distribution over. And as we'll see, that gives rise naturally to a score that avoids over fitting. To summarize the likelihood score is a score that evaluate a model G by looking at the log likelihood of the data relative to G using the MLE parameters for G. And, that set of parameters was optimized to maximize the likelihood of D. And therefore, is very, very well geared to trying to capture the exact characteristics of the data, for better and for worse. So, for better, it gives us a very nice information theoretic interpretation of the set of edges that are chosen, and the set of parents that are chosen for a given variable, in terms of the dependencies that that we encode in the graph G. Conversely, that same characterization also shows us that we're guaranteed to overfit the model to the training data, unless we somehow impose constraint, or otherwise penalize model complexity.