0:00

We previously described the problem of base net structure learning as a

Â optimization problem where we separated the problem into two pieces one is

Â defining a scoring function and the second is coming up with an algorithm

Â that optimizes that scoring function. We talked about how the optimization

Â problem can be solved in the case where we are trying to learn a network

Â structure as a tree or rather a forest where each variable has at most one

Â parent. But what about the more general case

Â where we are trying to learn an unrestricted network or perhaps one that

Â has weaker restrictions then simply a tree.

Â So how do we address that problem? So first as a reminder, our input is a

Â training set. A scoring function and some set of

Â possible network structures. And the output, the desired output, is

Â some kind of network structure that maximizes the score within the set of

Â families that, structures that we're willing to consider.

Â So what happens if we try to solve this optimization problem in the case where

Â the network, restricted network structure is not the class of trees?

Â Well, in the context of trees we could apply a simple greedy algorithm for

Â finding the maximum weight spanning tree but when we have a more general network,

Â even ones where we allow at most two parents per node, this greedy algorithm

Â that builds up a tree no longer works. It's no longer guaranteed to find the

Â optimal scoring network. Now in fact, it turns out to be not

Â surprising that this this simple algorithm is not going to work because of

Â this following theorem. The theorem states that the problem of

Â finding the best network, the one that has the highest score, with at most k

Â parents os an NP-hard problem for any k bigger than 1.

Â So for k = 1 we're back in the context of trees or forest, so we have the

Â polynomial-time algorithm, but even for k = 2 the problem turns out to be NP-hard.

Â And so no efficient algorithm is likely to be found for this problem.

Â So what do we do? It turns out that the standard solution

Â is some kind of heuristic hill climbing search.

Â So we have a current network, which is the one that we're trying to improve, and

Â we'll talk later about where we start this process.

Â And now we look at different changes that we could make to this network.

Â So we can consider for example making local protobations that may or may not

Â improve the quality of the network relative to the score.

Â What we see here, this bar represents the score.

Â So, for example, this transition adds an edge form B to D and we can see that here

Â that turned out to improve the score because this bar as we see is larger than

Â this bar. We can consider other changes, for

Â example, this transition removes the edge form B to C and also gives rise to a

Â slight improvement in the score but not as large as the one where we added the

Â edge from B to D. On the other hand, reversing the edge

Â form B to C. Which is this transition, gives rise to a

Â decrease in the score. And so we can look around us.

Â And see which changes to the network improve the score, and which ones don't.

Â And then probably try and go in the direction that generally improves the

Â score. There are two main design choices that

Â one needs to decide on when doing a Hera six search algorithm.

Â The first is the set of operators. That is, the set of steps that we can

Â take in perturbing one that worked to the other.

Â So, for example in the, in the example that we just saw, we were taking local

Â steps. We were adding edges, the leading edges

Â or reversing edges and these are fairly small perturbations in the network.

Â There's also algorithms that use much larger global steps where we've taken

Â entire node out of the network and stick it in somewhere else.

Â And that's a larger step that is more costly but also takes larger moves in the

Â space of networks. So this is one set of design choices.

Â The second set of design choices is which search technique we end up using, so do

Â we do what is called greedy hill climbing where we are just trying to climb up as

Â quickly a we can or do we adopt a different type of local communicatorial

Â search algorithm of which there are many different kinds some of which are listed

Â here and there are many others and we are not going to talk about those very much.

Â So, let's start by talking about Greedy hill-climbing which is the basis for most

Â of the Bayes network learning algorithm that are in use today.

Â So, how does Greedy hill--climbing work? We start with a given network initially

Â often this is just the empty network and the reason for starting with the empty

Â network is that it induces sparsity. we don't have too complicated a network

Â and one that's likely to over-fit. we could also start with the best tree,

Â which we get from one of the efficient tree-learning algorithms that we talked

Â about. We can also start with a random that

Â worked, this is a way of, exploring the space more broadly, instead of starting

Â off from a fixed starting point. We can start from multiple random

Â starting points or in cases where we have an expert that has some knowledge of

Â domains, we might use prior knowledge in order to initialize the process.

Â Having initialized, we now iteratively try and improve the network.

Â So we consider the different steps that we can take based on the operators that

Â we defined as being legitimate. And then for each of them we consider the

Â score that we obtain from all possible changes.

Â And then we apply the change that most improves the score so this is the greedy

Â component. That is we look at all possible changes

Â and we greedily pick the one that improves the score.

Â That gives us a new network which we can now again update by again considering

Â possible next step changes that we could apply to this new network.

Â And this process continues until we get stuck that is when none of them

Â modifications that we have at our disposal improves the score and when what

Â happens we are at what is called a local maximum that is, a place in the space of

Â network where no change gives rise to an improvement in the score.

Â What are some of the pitfalls of this Greedy hill-climbing process?

Â Well, the first is that this local maximum is potentially a local maximum.

Â That is, one that is not the global maximum.

Â So if we pictorially draw the space of, of possible networks.

Â We're going to draw it as continuous, even though it's discrete.

Â What we have is, sort of, a space that looks like this.

Â And if we start out from a point over here and we just take small hill-climbing

Â steps, we might end up in a position that is a very poor local maximum, compared to

Â the global maximum which could be considerably better.

Â And this is quite common in the kind of combinatorial search that we use in this,

Â this type of situation. A second problem is what's called a

Â plateau. A plateau, if we go back to this kind of

Â visualization that we have here, is a case where, looking around us.

Â There's a whole bunch of networks that are all, effectively, the same score.

Â And, while some of the directions might give rise to eventually, after a certain

Â number of steps to a hill, others might not.

Â And the problem is that we don't know which direction to go because they all

Â look exactly the same in terms of the score.

Â This happens quite often in the case of Bayesian network structure learning

Â because whenever you have a network. For example, like this,

Â other networks that are equivalent to them for example the one where we just

Â reverse this edge and we have the parent that has

Â two children, as opposed to a chain. This network is equivalent to this one,

Â I-equivalent. And therefore it's going to have the same

Â score. And in general we're going to have a lot

Â of neighbours when you have more complicated graphs.

Â A lot of networks that are equivalent and therefore have the same score.

Â And moving around between them, it's not clear which of those is going to

Â eventually give rise to a move that breaks us out of the plateau.

Â Into a into a network that actually has a higher score.

Â So, these are significant issues in terms of using Greedy hill-climbing search for

Â optimizing the score. [SOUND] The issue of

Â of local maximum plateau also relates to the question of search operators that we

Â used and we talked about edge addition, edge deletion, and edge reversal and the

Â question of that one might ask is why on earth would we need an edge reversal

Â because after all edge reversal can be subsumed by edge deletion and then an

Â edge addition so why do we need to have to add a whole new set of operators into

Â the, into the pool that we have to evaluate.

Â So this it turns out has to do with the greedy nature of our search.

Â So to understand that, let's look at this example.

Â Lets assume that this is our graph, G star, that generated the data and now

Â lets assume that we're going to try and learn a network from this, this

Â distribution derived from this graph. And, so here we have initiially our empty

Â network which we're going to start out with.

Â And we have two strong correlations in this network.

Â The one from between A and C and the one between B and C.

Â And let's say that we pick the one between B and C as being stronger and

Â therefore we're going to add it first. But now there is complete symmetry

Â between the edge that might go from C to B and the edge that might go from d to c.

Â these two networks are equivalent and they're going to have a same score and

Â so, we can arbitrarily choose the edge that goes from C to D because its as good

Â as the other one. Now that, that edge is in, we might go

Â ahead and add the other ones. Say we are going to add the edge from A

Â to C. And the question is now what?

Â The edge from C to B is wrong. And really we'd like to reverse it.

Â But if we don't have an edge reversal operation the only way to do that would

Â be to remove that edge. And then add in the edge in the other

Â direction. But removing that edge is going to cost

Â us a lot, in terms of score, because this was a really good edge.

Â In fact it was the first one that we added.

Â So removing it is going to cause a hit, in terms of the score.

Â And it's going to be, a sub-optimal move. It's not going to be a great move.

Â And so if we don't have an edge reversal operation this red network, the one that

Â has the edge from A to C and the edge from C to B, is now going to be a local

Â maximum. And we're not going to be able to break

Â out of it by looking just greedily within one step, unless we have an intraversal

Â step. So as reversal is a way of avoiding these

Â very common local maxima. but it doesn't address all local maxima

Â specifically not the bigger ones the we have also discussed.

Â So, what is the algorithm that people typically use for addressing this problem

Â without falling into this, local maximum that common, that frequently?

Â So, turns out that a pretty good simple algorithm is the following.

Â We do use greedy hill climbing, but we augment it with two strategies that are,

Â attempts at avoiding local maximum plateau.

Â The first is what's called random restarts, which is, if you're climbing

Â up, and you get stuck. You take a certain number of random steps

Â and then start climbing again. And the hope is that if we're at a local

Â maximum that's fairly shallow then the small number of random steps will move us

Â into the attraction area of a better maximum and we'll continue to climb to a

Â better optimal. A second strategy, which is useful for

Â both local maxima and plateaus is what's called a taboo list.

Â A taboo list is a way of avoiding treading the same ground over and over

Â again. So, here we have a set of steps that we

Â have taken. For example, adding an edge from A to B.

Â We're deleting an edge from C to B and these are the steps that we've recently

Â taken. We keep a list of the k steps that we've

Â recently taken and we constrain our search so that it cannot reverse any of

Â these steps. So, specifically, if add A to B is on our

Â list, then on our taboo list we have that delete,

Â the edge from a to b is not something that we can do while the add step is

Â still in the list of the k most recent steps.

Â And that forces us to make forward progress rather than doing the same step,

Â reversing it, trying a different version of it, reversing that.

Â And that turns out to be helpful both in avoiding local maximums as well as making

Â some progress in the context of plateaus. So how does, with this, how well does

Â this work in practice? Let's first look at a synthetic example.

Â This is the ICU alarm network that we've seen before in the context of Bayes net

Â parameter estimation. And what we see here are two learning

Â scenarios. One in the blue line, down here, is

Â learning parameters only. With the true structure given.

Â And the green line. Is where we're trying to do both

Â parameters and structure at the same time, so trying to learn both.

Â . And we can see on the X axis the number

Â of data instances, M. And on the Y axis the distance between,

Â the distance are measured as KL divergence or relative entropy, between

Â the learned network. And the true network.

Â 14:35

And what we can see here is that, it's certainly the case that when we're trying

Â to learn only parameters we do better at least initially that is for small numbers

Â of samples the case we have true network performs better in terms of learning and

Â distribution, it's close to the right one.

Â But as we get more and more data. even with as few as around 2,000 samples,

Â we're actually pretty close in performance.

Â Now, this is very promising because it says that the structure learning problem

Â is not intrinsically that much more difficult than that of parameter estimate

Â than that of parameter estimation alone. Now.

Â This result, however, should be taken with a grain of salt because the problem

Â of learning from synthetic data, which is what we have here.

Â This is data that is generated from. The network is actually easier than the

Â problem of learning from a real data set because synthetic data has a much clearer

Â and stronger signal, about the correlation structure than real data.

Â And so, the dis, so the, the performance here is probably a little bit misleading

Â in terms of our ability to learn the network structure correctly when the,

Â when we don't have. when, when the data's not quite as nice

Â as synthetically generated data. However there's a lot of applications

Â where even with real data, base net structure learning has given significant

Â advantages. One of those is this application from

Â Microsoft research, which aims to, learn, to reco, to predict, where, traffic jams

Â are going to take place, as well as, where we might have surprises in, terms

Â of having less traffic or more traffic than normal.

Â So here the idea is, that we have some number of sensors.

Â On certain major freeways. And we know for example that certain.

Â areas of the freeway have a lot of traffic.

Â Other areas have, less traffic. And, we're trying to use that data to

Â predict. Parts of the road system that don't have

Â these sensors observed in them. As well as to predict how the traffic

Â would look in, 30 or 60 minutes. So, in particular we'd like to predict,

Â for example, parts of the road where, there's going to be, more traffic than we

Â expect. So predicting surprise.

Â As well as a good surprise, such as less traffic than we expect.

Â So this is a u scenario, where this is some amount of prior knowledge.

Â But it turns out that expert knowledge is not really adequate for understanding how

Â different sources of information, different pieces of the road system, are

Â intertwined with each other. And much better performance was derived

Â by learning a Bayesian network structure from data.

Â That shows how these different, different accident reports.

Â Different, the hour of the day. And different areas in the freeway system

Â are related to each other. And we can see that some of the

Â correlations that were learned might not be quite as obvious as a human might have

Â inferred so, for example, the influences on this.

Â Traffic at, in location fifteen. The most significant factors are the ones

Â marked here in four, eleven, and seventeen.

Â And, you might not have necessarily predicted that something that's all the

Â way across the bridge in Seattle is the most indicative factor on this particular

Â prediction problem. And so much better performance was

Â derived by applying learning techniques to this problem.

Â 18:28

A very different application of structure learning is in the context of this

Â application which aims to do scientific discovery.

Â In this context a data set was acquired that measures for different proteins in a

Â cell, the level of those proteins. And this was done at the level of single

Â cells so there were tens of thousands of measurements, each one corresponding to a

Â protein profile in in an individual cell. Now, the goal from this experiment was to

Â discover the control mechanism of how the level of one protein can influence the

Â level of another. So, to learn, for example that a change

Â in the protein level of some, of a protein such as PKC can influence the

Â level of a protein called PKA. And the level of a protein called RAF, as

Â part of the regulatory network of the cell.

Â So, this theta set was acquired and Bayesian network learning was applied to

Â this problem in order to try and understand this regulatory network within

Â the cell and what we see here are the edges that were learned by the by the

Â Bayes net learning algorithm. And we can see that many of the edges,

Â specifically all of the ones that are marked in blue, were ones that were known

Â in the literature before. Now while that might seem perhaps less

Â interesting to rediscover something that was already known, this is an important

Â validation for the Bayes net learning algorithm because, because it shows us

Â that the network is learning something valuable and correct.

Â And furthermore, it shows that, in a single experiment, Bayesian network

Â learning applied to the right kind of data, can reconstruct a cellular network

Â that took biologists many, many years to put together.

Â Now not all of this is right so for example this a pink edge over here is an

Â edge learned in one direction and really should have been learned in the other

Â direction that is it is reversed relative to our current biological understanding

Â and some other edges specifically the ones that are these dashed lines for

Â example this one. Or that one or that one, were omitted

Â from the network. That is, they should've been there but

Â not correctly learnt. So this is definitely not perfect but

Â nevertheless it's a surprisingly impressive success story in the context

Â of taking a single data set and reconstructing many years of unlearnt

Â biology. What's also important is that some of

Â these edges that were learned, these two red ones.

Â Were ones that've had some very weak support from the literature.

Â that is, they were known to exist in a different cell type than the B cells in

Â which these data were acquired, but were not known to exist in D cells.

Â And one of these, specifically this one, was subsequently tested in a wet lab and

Â found to be true in B cells, as well. And so, the network revealed to

Â biologists a connection that had not previously been established, in the

Â context of B cells, and turned out to be of importance because it corresponds to

Â what's called. Talk or communication between two

Â different biological pathways. And so this is a very nice example of how

Â Bayesian network structure learning can be used in this different paradigm.

Â Where not only do we not necessarily have, expert knowledge, and we want to

Â improve performance. But rather, discovering the network

Â structure is a goal in and of itself. To summarize, Bayesian network structure

Â learning is a useful tool for building better predictive models, when the

Â [INAUDIBLE] experts don't know the structure.

Â And that can be useful, both for making predictions about, new, new instances.

Â But also, as we saw in this most recent example, for knowledge discovery as well.

Â So, for these very two different purposes.

Â We've shown that in general, finding the highest scoring network structure outside

Â the context of forest or trees is an NP-hard problem.

Â And that, that is typically used, solved using heuristic search, and most commonly

Â this is done using local steps that modify the network using small local

Â proterbations such as edge addition deletion or reversal, and in order to

Â avoid the local optima that one would. Get from this.

Â We, typically use, not just simple hill-climbing.

Â But modify that using both tabu list that avoid, sort of, undoing recent changes.

Â As well as random restarts, so as to explore different parts of the space.

Â While this is a surprisingly successful strategy, it turns out that there are

Â actually much better algorithms out there that make much larger changes in the

Â space simultaneously. These are computationally more expensive

Â are harder to implement but can make much larger moves in the space, and therefore,

Â especially for large spaces where the signal isn't immediately clear can find

Â better structures than a simple greedy search.

Â