For BIC or BDE, we know that there is a penalty score.

So, even if introducing an edge from X-j to X-I increases the likelihood, we might

end up paying for it more in the penalty score than what we gained.

And so the weight, this weight, can actually be negative.

At which point the optimal structure might actually be a forest.

That is, it might be detrimental to our overall score to make the graph fully

connected. Mostly connected, I mean, graph

connected. That is, a tree rather than a forest.

Okay. Now a second observation.

A score remember satisfies score equivalence, this is a property that we

defined before if I equivalent structures have the same score.

And all of the scores that we talked about are score equivalent.

It turns out this first score equivalence score we can show that the weight from I

to J is equal to the weight from J to I. That is we can compute this expression

over here. For J to I or for I to J, we're going to

get the exact same value, and you can see that in the likelihood square directly

because mutual information doesn't have any kind of directionality associated

with a mutual information of XI to XJ is the same as mutual information form XJ to

XI, so for likelihood square you could just see that and it turns out that for

the other cases it's also it also follows.

And so that means that we can define an undirected graph, where it doesn't really

matter where I, whether I go from I to J, or from J to I.

And because the weight is the same for both.

And so 4-score equivalent scores, we therefore get the following algorithm.

We can define an undirected graph with nodes being the variables in my graphical

model. And then I define a weight, WIJ, to be,

the max, of the score XI, between XI and XJ and zero, and remember that it doesn't

matter whether I do IJ or JI, here, we're going to get the exact same thing for

score global networks. The zero, is going to be, a way of

eliminating, negative edges so that I can optimize things, efficiently without

having to worry about negative edge cost. Now I can go ahead and find a forest or a

tree that has the maximal weight. And in, and the nice thing is it turns

out that you could use any of the standard algorithms for maximum weight

spanning trees in order to do this. So for example, either Prim's or

Kruskal's algorithm can be used to solve this and we can in fact find the solution

in prime that has all been squared which is an inevitable cost given that I'm

considering all N squared combinations in in terms of pairs of edges.

And then, if I end up having edges of weight zero that indicate that the edge

they're, potentially contribute was derived from a negative component of the

score. So I remove all of the edges whose weight

is zero to produce a forest. And that is an O of N squared time

algorithm for finding the absolutely optimal tree relative to any of these

three score for those scores that we've defined.

So let's look at an example. This is the ICU alarm network again, this

is a picture of the original of the original network and if we apply this

tree learning algorithm, we end up with the following structure.

Where the edges that have been marked in red are edges that existed in the

original network and the blue ones are edges that are spurious, that is, they

weren't in the original network. So this blue one over here, for example,

isn't in the original network. And so, we see that although many, mostly

the num, the edges that we end up with are edges in the original network, some

aren't, some are derived from, correslations that, went through indirect

paths in the original network. And the other thing that's important to

see, looking at this, is that the infered edges in the tree are intrinsically

undirected, it doesn't mean that I can't, force a direction on them, but the

direction isn't determined by my optimization algorithm and that means

that I'm not able to determine what came before what in the original graph when

I'm learning a trait. Or to summarize structure learning as is

an optimization problem over the communitorial set of space of graphs the

decomposibility property allows me to break that up as a sum of terms for

different families and that allows me to optimize in the specific case of tree

structured networks using standard maximum weight spanning tree algorithm

very efficiently in quadratic time.