0:00

We talked about how the problem of learning a general Bayes net structure

Â can be viewed as a combinatorial optimization problem in which we are

Â trying to find a high scoring network structure and how that's problem, that

Â problem is usually solved by some kind of heuristic search over the space of Bayes

Â net structures. Let's talk now, about the computational

Â cost of this algorithm and how we can use the property of decomposability, which we

Â also previously used in the context of tree learning to considerably reduce the

Â computational cost of this of this algorithm.

Â So as a reminder, the heuristic search procedure that we discussed iteratively

Â moves over the space of networks. And so if this is the network that we're

Â currently at, we need to evaluate several different moves away from that network

Â that [INAUDIBLE] usually local ways by adding, deleting, or reversing an arc.

Â We then score each of these subsequent networks and see which of those has the

Â highest score and take that move. And one can also have other algorithms

Â that evaluate that, that don't necessarily make the greedy move at each

Â point, but basic idea is the same. So what is the computational cost of this

Â algorithm? Let's do the naive computational

Â analysis. what we would get if we were to do the

Â naive implementation of this. So, how many operators are there in each

Â search step that we need to evaluate? so in a Bayesian network with n nodes, we

Â have n(n - 1) different possible edges.

Â [SOUND] One now, each of those edges is either

Â present in the graph or not present in the graph.

Â If the edge is present in the graph, we can either delete it,

Â [SOUND] reverse it. And if it's absent,

Â we can add it, which means that effectively, we have

Â either two or one possibilities for each of those n(n - 1) possible edges.

Â And so, we have O(n^2) possible operators that we need to evaluate at each point in

Â the step, each, each step in the algorithm.

Â What is the cost of evaluating each of the candidate successors that we would

Â get by taking one of these operators? So, reminding ourselves that there are

Â multiple components in the scores, one for each variable in the network,

Â because of the decomposability property. So we have n different components, and

Â for each of those, we have to look at the sufficient statistics and compute the

Â resulting entry in the score corresponding to that variable.

Â Computing sufficient statistics requires a traversal over the training data and so

Â that's something that takes O(M) time where M is the number of training

Â instances. So altogether, this step requires O(n *

Â M). Now, we haven't talked about this, but

Â one also needs to make sure that the resulting graph from this operator is in

Â fact acyclic and so we need to do an acyclicity check.

Â This is something that, in general, requires O of little m time, where m is

Â the number of edges in the graph. So, altogether, if we sum up all of these

Â different pieces of the cost, we end up with a computational cost which is O(n^2

Â * (Mn + m), whereby in large, little n is usually dwarfed by M, by big M times n

Â and that is the cost per search step. Now if you think of a network, it's not

Â even necessarily that large, something that has 50 or a 100 variables, so that n

Â is 50 to a 100, and the number of training instances is 10,000, this can

Â get really, really large and generally intractable to do in a lot of situations.

Â So how can we improve on this? Let's see how we can exploit the

Â decomposability property to get a significant computational savings in the

Â search process. Let's first look at a single small

Â operator such as the one where we add an edge between B and D to this network

Â where where such an edge did not exist before.

Â And, let's consider the score that we had for

Â the original network, which in this case is, because of decomposability property,

Â is the sum of family scores for the individual variables.

Â So we have a component that lists the score of A relative to its empty set of

Â parents, the score of B relative to the empty set, C relative to its parents A

Â and B, and D relative to its parent C. Let's compare that to the score of the

Â network following the move and we can see that this score for the new network is

Â actually very similar to the score of the original network, because, for the same

Â decomposability property we can break up the score into these little components

Â and since most families haven't changed, that component in the score is going to

Â be the same. Specifically, we're going to have the

Â exact same score for A relative to its empty parent set, the same score for B,

Â the same score for C, and only this only this last score for D is now going to be

Â different, because of the fact that we've modified the family for D.

Â But that immediately suggests that we don't really, really need to recompute

Â these earlier components of the score, because they haven't changed. We only

Â need to compute the last component corresponding to D.

Â In fact, we can do better than that by, instead of considering the absolute

Â score, instead, we're going to compute what's called the delta score which is

Â the difference between the score of the network following the change, this

Â network, and the score of the original network.

Â And we're going to compute the difference between those two scores, which as we can

Â see, is just the difference between the scores of the two families for D the B, C

Â family in, in the following, in the new network versus the C family and the

Â original network. And, that delta score is going to be

Â something that we can compute just looking at a single family, ignoring the

Â rest of the network. The same thing happens for other local

Â operators. So for example, if we consider now the

Â deletion operator that deletes an edge between B and C the, and we look at the

Â delta score, that delta score only cares about the the family C, because that's

Â the only family to have changed. And that's going to be, again, the

Â difference between the score of C with the single with the single edge A minus

Â the score of C with the family A, B.

Â So again, only one family gets affected by this change.

Â For an edge reversal, the situation's a little bit more complicated, because we

Â can see that by flipping the edge from B to C and making it go from C to B,

Â there's actually two families that are affected, the B family and the C family.

Â But that's still just two as opposed to the entire network and so we can see that

Â that delta score is going to have two components,

Â one is the delta score for the C family and the second is the delta score for the

Â B family. But in either case, we only end up

Â affecting either one family in the case of edge addition or in that case, case of

Â edge deletion or at most two families in the case of edge reversal.

Â And so, that means that we only have to consider a much smaller fraction of the

Â network when computing the delta square. A second use of the decomposability

Â property comes up when we consider multiple consecutive steps of the search

Â algorithm. So let's imagine that in the previous step, we decided to to take

Â that, step that, added the edge, from B to C, and now we have to consider the

Â next step in the search. Well, what are the operators that are

Â conceivable in this in this next step of the search? For example, one such

Â operator is to delete the edge from B to C [SOUND] this one.

Â [SOUND] And notice that that edge deletion operation is in fact an operator

Â that we also considered in the previous step of the search before we decided to

Â add the edge from B to D. Now, normally is this operator the same,

Â notice that the family of C hasn't changed between those between those two

Â cases, in both of these cases when we're considering the move C has the parents of

Â A and B. And so, the delta score for that particular move is not going to change

Â either, specifically, if in this case, the delta score was this score of C

Â minus, given family A minus the score of C given the family A,

Â B. We have exactly that same score,

Â the same delta score in in the previous step.

Â That is these two delta scores in the previous step and the new step are

Â exactly the same and so there's really no point to recompute it.

Â [SOUND] So which scores do we need to recompute?

Â The only scores that we need to recompute are the ones that were affected by the

Â step that we currently, that we just took.

Â So specifically, if we took a step that modified the family of D, then any step

Â that involves an additional change to the family of D will have a different delta

Â score, because the family is now different in doing the comparison.

Â However, families that were not affected by the move, don't need to be recomputed.

Â So to summarize, we only need to rescore delta moves, delta scores for families

Â that changed in the last move. [SOUND] So let's summarize the

Â computational cost of this procedure. What's the cost per move?

Â Having decided to take a move, we have only one or two families that were

Â affected by that move, that means that only O(n) delta scores

Â need to be computed, because for a given family, there is only n possible edges

Â that that impinge on that family. So only O(n) delta scores need to be

Â computed, and for each of those, we need to compute sufficient statistics which

Â takes O(M) time. So altogether, we have O(n * M) as the

Â cost of doing this step, which is actually two orders of magnitude better

Â than the old n^3 * M that we had for the original naive computation.

Â Now this tells us the cost after we pick a move.

Â What about the cost of picking a move? Now, naively, we might say that there is

Â n^22 possible operators that we can consider any in given move,

Â and so, we need to evaluate each of them, and pick the best one or consider each of

Â them, and pick the best one. But, in fact, we can do considerably

Â better by the use of clever data structures.

Â So specifically, we can maintain a priority queue of operators sorted by

Â their delta scores. Now, when we rescore those O(n)

Â operators, in this step over here, we need to modify

Â the score of those operators and reinsert them into the priority in their

Â appropriate place. But that's a computation that requires O(n log n) time

Â because there's only n of M. And, once we have done that, the best

Â operator will be at the top of the list which we can then take identify and take

Â in constant time. And so, this priority queue saves us

Â complexity by taking us from O(n^2) time for picking this for traversing the set

Â of operators to something that requires O(n log n).

Â And so, altogether, we've actually saved a considerable cost in both of these

Â steps. [SOUND] It turns out that one can get an

Â even higher level of computational efficiency based on a different

Â observation. So, it's a fact that in most network

Â learning algorithms, the plausible families, the ones that have some chance

Â of being selected are variations on a theme.

Â That is, for a given variable A, there might be some number of variables you

Â know, B1, B2, up to Bk for a very small k that are

Â reasonable parents to be selected as parents for A.

Â And so, how do we exploit that property in

Â computational, to get computational savings?

Â Turns out there's two different ways that one can utilize this.

Â The first is the fact that because we have the same set of variables being

Â constantly considered as being candidate families,

Â it means that we can use sufficient statistics that we computed in one step

Â and reuse them in a different step if we cash them,

Â because, because we're likely to encounter the same family more than once.

Â We might encounter B as a parent of A and A as a par, as a possible parent for B,

Â and for both of these, we have the sufficient statistics A, B that are going

Â to be utilized for both of them. And, so, if we compute this once and then

Â cache it, these sufficient statistics, we don't have to recompute them again.

Â That turns out to make a huge difference in terms of the computational cost of

Â this algorithm. The second way in which this could be

Â used is that if we can identify in advance the set of B1 up to Bk that are

Â reasonable parents to consider for A, we can restrict in advance that set and not

Â consider other parents at all, which reduces our complexity from O(n)n) to

Â O(k),k), where k is some bound on the number of plausible parents that we're

Â willing to consider. Now this, now is the heuristic in the

Â sense that this is a restriction that can actually change the outcome of our

Â algorithm. It's not just a way of reducing the cost,

Â but it turns out to be a very good approximation in practice.

Â To summarize it turns out that even the fairly simple heuristic structure search

Â that we employ in in, in such as greedy hill climbing can get expensive when n is

Â large, because the naive implementation has a cubic dependence on n.

Â But we can exploit the decomposability property, that we also exploited in the

Â context of tree learning, to get several orders of magnitude reduction in the cost

Â of the search. We can also exploit the recurrence of

Â plausible families at multiple steps in the search algorithm to get further

Â computational savings and also restrict the search space to to get better better

Â computational property.

Â