0:00

We previously defined the intuition for the dual decomposition algorithm which

Â takes an optimization objective and decomposes it into a bunch of different

Â slaves. And uses the communication between the

Â slaves to try and get them to agree with each other about the overall map

Â assignment. So now, let's take this intuition and

Â convert it into an actual algorithm. So, what we have here is each slave gets

Â its own pet objective function. So here is the objective function for the

Â ith slave, for the slave corresponding to the variable xi.

Â And here is the objective function that the slave corresponding to the larger set

Â F is trying to optimize. Now, each of these initially has, as a

Â component, its own local objective function that is assigned to it.

Â But in addition, we're going to introduce for each of those a set of terms,

Â these terms, that modify its objective function so as to try to get it to agree

Â with with the other slaves. Specifically, we're going to try and get

Â the x, the i slave to agree with the factors F,

Â in which i appears. And conversely, we're going to try and get the F slave to agree

Â with the slaves of the variables in its scope.

Â So, let's see how that gets done. We are going to initialize all of these

Â incentive terms, all of these lamdas, to be zero.

Â Initially, each slave makes its decision based purely on its own local piece of

Â the objective. But then, as the algorithm evolves, these

Â incentives start to change. So, how does that happen?

Â At each iteration t, each slave optimizes its current

Â objective. So, the i slave optimizes theta i hat

Â lambda, where lambda is the current is the

Â current set of the penalties. The current set of penalties.

Â And that gives it an assignment which we're going to call xi star.

Â The F slave does the exact same thing. It optimizes its own local objective,

Â which is theta hat F. Also using the same set of lambdas.

Â And that gives it x hat sorry. x star of F, which is this guy over here.

Â Now, let's see what happens if there's a disagreement between a large slave F, and

Â a variable i in its scope. What we would like to do is we would like

Â to incentivize them to agree with each other.

Â And so, what we do is that for all F, and all variables i in F, if they

Â disagree. That is, if the chosen variable for xi in

Â the assignment chosen by the factor F is different from the assignment of i chosen

Â by the i slave, then we need to change the incentive

Â function. And we do that by basically changing the, the, the lambdas.

Â And so, specifically we see that here when we look at the penalty for the pair,

Â Fi, which is over here.

Â We decrease alpha t, we'd decrease by a factor alpha t, which is greater than

Â zero. the preference for xi star,

Â which is the value chosen by by the i slave.

Â And if we see that, if we look at that, we see that, that is going to

Â disinsentivize] the i slave from picking the value that it picked.

Â Conversely, we increase the preference for the var, for the value chosen by the

Â F slave, which is this value over here,

Â by the same amount, alpha t. Now, that changes the preference of the i

Â slave, but it also changes the preference of the

Â F slave by the exact same amounts, but in the opposite direction.

Â Because notice that, here we have a negative sign,

Â whereas here, we have a positive sign. So it makes both the i slave and the F

Â slave move in the same direction, move in opposite directions so as to agree with

Â each other. So, what happens as we continue doing

Â these iterations over time? Well, one can show that under fairly weak

Â conditions on these on these parameters alpha t, the lambdas are guaranteed to

Â converge. What are these conditions?

Â The alphas need to be big enough so that they make a difference.

Â So that if we sum up all of the alpha t's for all t we end up with diverging

Â sequence that sums to infinity. On the other hand, they need to be small

Â enough so that we eventually achieve convergence.

Â And that is the thing by the second condition on the, on the summation over

Â here. Now, in fact, one can also show that not

Â only do we get convergence, we get convergence to unique global optimum

Â regardless of the initialization. So, the that lambdas always converge.

Â Now, the question is what are the implications of that, of the convergence

Â of the lambdas? And, so what happens at convergence?

Â At convergence, you can show that there's no more change.

Â And so at that point, each of our slaves has its own locally optimal solution over

Â its own variables, the ones that are in its scope.

Â [SOUND] So, is that good enough? Unfortunately, not always. That is, even

Â though we have achieved convergence, the solutions of different slaves may or may

Â not agree on the variables that are shared between them.

Â However, if they do agree, if we can get all of the slaves to agree with each

Â other, we can, we have, we can show that the shared solution is a guaranteed map

Â assignment. So, if the slaves agree, we have

Â effectively our initial optimization problem, which is that of finding a map

Â assignment. If that doesn't happen,

Â that is if the slaves have not agreed on their shared variables, then we need to

Â solve what is called a decoding problem. Where where we need to somehow put

Â together these different solutions that the different slaves have converged to

Â and come up with a single joined assignment.

Â So, how do we do that? How do we solve this so-called decoding

Â problem? It turns out that people have constructed

Â a range of different solutions for this. None of which have particularly strong

Â performance guarantees, but they often work quite well in

Â practice. So, for example,

Â if we use the decomposition of the of the original problem into spanning trees,

Â then each of the standing trees covers all of the variables.

Â So we can take the map solution of any individual tree and say that, that is a,

Â an assignment to all of the variables. So, that's one approach.

Â A second approach that is not a one, a winner takes all, is to have each of our

Â slaves vote on the xi's that are in its scope,

Â and then for each xi, you just do a majority voting.

Â So, that's another approach. A different approach uses some kind of

Â weighted averaging of the sequence of messages that are sent regarding each xi

Â through the course of the iterations of the algorithm.

Â So each of these works reasonably well, but, but it turns out that the right

Â solution is to simply use multiple solutions and then pick the best one.

Â And, the critical observation here is that the scoring function theta for any

Â assignment is something that we can easily evaluate.

Â So, why not do all of these and use multiple trees,

Â and generate a whole collection of x's, and then evaluate theta x for each of

Â those solutions, and then pick the best one?

Â And it turns out that when you do that, you generally get solutions that are

Â better in terms of their overall score than any fixed strategy.

Â Now, the other nice aspect of this analysis comes back to a point that we

Â made earlier in this in this module. Which is that for any lambda, this

Â function that we defined as L of lambda is an upper bound on the value of the

Â true map assignment. Now, what does that imply?

Â It means that if you give me any x that you think is a reasonable candidate,

Â so this is a candidate for the map assignment.

Â 12:19

The third which we briefly mentioned before is how do you adjust the step size

Â alpha t so as to move things towards convergence fast, but still slow enough

Â to generate convergence. There's some harden fast rules that

Â guarantee converges, but they're often quite slow and there's cleverer rules

Â that still guarantee converges and tend to work better in practice, and this is

Â an active area of research that's ongoing right now.

Â And, then finally, we've talked about the decoding problem and the fact that there

Â is many heuristic choices there. And so, that's another design choice that

Â one needs to make when implementing dual decomposition.

Â So, let's summarize different aspects of what we've talked about.

Â First of all, let's summarize the algorithm.

Â Dual decomposition is a general purpose algorithm for map inference that works by

Â dividing the mole into tractable components that could be potentially

Â multi, multiple factors in a single component or slave.

Â Solves each slave locally and then passes these lambda messages between them to try

Â and get them to agree. It, and, and, we've talked before about

Â the presence of tractable map subclasses. Any tractable map, map subclass can be

Â used in the setting as a slave, will allow slaves to be large and richly

Â structured as opposed to just little tiny factors that are going to be only making

Â very myopic decisions based on minimal information.

Â One of the nice things about the dual decomposition algorithm is that is has a

Â lot of theoretical backing to support it. So, formally,

Â and we're not going to go into the details of what this means,

Â but just mention what this actually doing from a formal perspective.

Â This is a sub gradient which is kind of like a gradient optimization, but not

Â exactly. A sub gradient optimization algorithm on

Â a problem that is a dual problem to the map problem.

Â And if you didn't understand that, that's okay.

Â This is just sort of a pointer for those of you who might have seen some

Â optimization before to kind of fit this in to that world.

Â But, this view does provide some important guarantees, some of which I've

Â mentioned in this presentation. First, it gives us an upper bound on the

Â distance to the between where we are and the true map assignments.

Â And, it provides us with conditions, such as agreement among all of the slaves that

Â guarantee a condition under which we have actually found a true, exact map

Â solution. And there's even some interesting

Â analysis, which I'm not going to present, that tells us when one decomposition of

Â the model into slaves is better than a different decomposition of the model into

Â slaves. Now, in practice, this algorithm has some

Â significant pros and cons. On the pros, it is very general purpose.

Â You could really use it for any model if you just, if you're willing, for example,

Â to decompose the slaves fine enough. Of the algorithms that are currently out

Â there, it provides some of the best theoretical guarantees, in terms of

Â convergence properties. And, because of the way that it is

Â structured, it can use these very fast, specialized map subroutines for solving

Â large-model components, like matching problems, for example, or modular or, or

Â subsets of edges that are are sub-modular associative.

Â On the other hand, it's really not the fastest algorithm out there.

Â That is, there are algorithms that, when they work, work considerably faster than

Â than the dual decomposition algorithm. And, more so than most, it has a lot of

Â tunable parameters and design choices which make it a somewhat finicky

Â algorithm to apply in practice because one has to play around quite a bit in

Â order to get the performance.

Â