0:00

Over course of the last unit that we've covered in the class we've discussed a

Â range of different inference methods, some exact, some approximate sometimes

Â solving the problem of computing marginals, sometimes for computing map

Â assignments. And the question is if we're faced with a

Â particular task that we're trying to deal with and we've designed a graphical model

Â which inference algorithm should we use. So what are some of the choices and what

Â is some guidance that we can offer about what which of methods we want to use.

Â The first question we should answer for ourselves is

Â which task is more suitable for our problem?

Â Are we trying to solve the problem of computing a map assignment, or are we

Â trying to solve the problem of computing marginals?

Â And these offer sometimes not obvious trade-offs in terms of what's right for

Â our application. [SOUND] So, computing marginals by and

Â large is a less fragile way to go, because we don't pick just a single

Â hypothesis. We compute a probability distribution

Â over a range of different options and we can see for example, is our top option a

Â little bit more likely than the second best or a lot more likely than the second

Â best? And.

Â That gives us some guidance as to how robust our inferences are.

Â that also gives us some amount of confidence that we can provide in are

Â answers and maybe give to the user as guidance in terms of how much to trust

Â thee the answers produced by the system. And as we've also seen computing

Â probabilities is important for supporting decision making once we need to integrate

Â for example, with the utility model. On the other hand, map is suitable in its

Â own set of contexts. for example, if we're trying to compute a

Â coherent joint assignment for example, in the context of a speech recognition

Â problem or an image segmentation problem. And sometimes for important that we get

Â something that makes sense as a whole than then we get the best possible

Â answers to individual pieces of the inference problem like individual pixels

Â or individual foams. but in a way that doesn't make at in the

Â context of of the larger problem. And so the notion of coherence is

Â sometimes important and is worth the trade-off of reducing the robustness.

Â We've also seen that from computational perspective map has a range of model

Â classes that are more tractable than in the case of computing marginals.

Â So for computational reasons we might sometimes adopt the use of a map solution

Â just for improved efficiency. And we've also seen that, especially when using

Â approximate inference, the map assignment often provides us with some theoretical

Â guarantees in terms of how close our answers are to the correct answers for

Â example, in the context of dual decomposition we've seen that, where as

Â it's difficult. We get a similar level of confidence in

Â terms of our, the accuracy of approximate inference for competing marginals.

Â So on the other hand, when running a

Â proximate inference, there are again different trade offs for these two

Â classes of problems. So,

Â in when computing marginals one often in gets for approximate inference algorithm

Â the fact that errors just by being soft but just by having soft assignments,

Â errors often are kind of tenuated. And, the source of inaccuracy in one part

Â of inference is often doesn't make it's way significantly into other parts of the

Â model. And so you often get more robust answers

Â in approximate inference, when competing marginals.

Â On the other hand, on the map side, one can at least gauge

Â in many cases, at least in some algorithms such as dual decomposition,

Â whether the algorithm is working or not by looking at the value of an assignment

Â that we get from the algorithm. So, again, somewhat different tradeoffs.

Â And in some applications, people actually try both and see which one works better

Â in terms of the performance and the downstream tasks that one cares about.

Â [SOUND] So what are some algorithms that we've discussed for doing marginal

Â inference. So first is just good old exact inference

Â say, variable elimination and clique tree, and by and large if the problem is

Â small enough that exact inference fits in memory.

Â 4:19

In terms of the sizes of the cleats and the subsets, it's probably good to use

Â exact difference. You run into less problems this way and

Â if it fits in memory, it's probably going to be very fast.

Â If one is not in this fortunate situation, we've discussed.

Â Different types of algorithms that are approximate.

Â there's the range of algorithm's due message passing over the graph, of which

Â loopy message passing is one of the more common ones.

Â But not the only one. And then there is the class of sampling

Â methods that sample from the distribution.

Â And these are different categories of algorithms.

Â And frankly, it's often difficult to tell in advance which of them is going to work

Â better for a given class of models. and you talk a little bit about factors

Â that might favor one of these algorithms over another in a subsequent slide in a

Â couple minutes. What about algorithms for map?

Â Once again, we have algorithms that perform exact inference and in this case

Â its actually a broader spectrum then in the case of computing marginals.

Â So in addition, to cases where we have low tree width, which is the category for

Â which marginal algorithms work we've also seen other examples such as those with

Â associative or regular potentials and multiple other cases.

Â And once again if you can perform exact inference that's really often the best

Â way to go. you're Â·Â going to, it, it's often going

Â to give you the best performance if it's tractable.

Â Then there's the class of methods that are based on optimization.

Â And those can be exact methods which puts us in this category, but more often we're

Â going to find ourselves in the case where we have to use some kind of approximate

Â methods such as a duley composition algorithm that we've talked about/ And

Â those methods often as we've said lend themselves to some kind of at least being

Â able to estimate the performance of our algorithm relative to the optimum answer

Â even if we can't get to the optimum answer.

Â Finally, there is the range of algorithms that perform search over the space.

Â And this can be simple hill climbing, search which is is a fairly

Â straightforward application of standard search methodologies, but also it's,

Â is quite common to use some kind of sampling.

Â Like, Markov Chain Monte Carlo sampling to explore a range of different

Â assignments in the space, and then select among those the ones that have the

Â highest, or the one that has the highest score or

Â the highest yeah the highest log probability.

Â And that is quite commonly used technique, it doesn't provide the same

Â set of guarantees that you might get from the optimization methods.

Â But it's often very easy to implement and so's quite frequently used.

Â So if we're resorting to the use of approximate indifference.

Â What are some of the issues that might make our lives more complicated or might

Â favor one algorithm versus the other? So one complication has to do with a

Â connectivity structure. Just how densely connected the model is.

Â And by and large, the more densely connected the model is, the worse it is

Â for message-passing algorithms. So, message-passing algorithms don't like

Â densely connected models because the messages are propagated over very short

Â cycles, and that can cause both issues with convergence.

Â As well as with ac, with lack of accuracy.

Â So, lack of convergence and lack of accuracy.

Â Sampling methods are less effective by the connectivity structure.

Â The second com complicating factor is the strength of inference.

Â That is the extent to which nodes that are connected to each other have tight

Â coupling in terms of the preference for certain combinations of values.

Â In general, the stronger the influence, the harder it is for both categories of

Â oh of algorithms because it creates strong coupling between variables.

Â Which can complicate both message passing algorithms as well as sampling

Â algorithms, because for example if you think about plain gib sampling it makes

Â it very difficult to move away from the current configuration to a different one.

Â Strength of influence becomes an even worse problem when the influences go in

Â different directions. So for example, if you have loops where

Â one path is tending to make these variables take on one configuration.

Â And a different path is trying to make them take on a different comm combination

Â of configurations. So, for example, as we saw in the

Â misconception that where one path wants a pair of variables to agree on their

Â values and the other wants them to disagree on their values.

Â That really does create significant problems for both classes of methods.

Â And the reason for that, and this I think is at the heart of what makes approximate

Â inference hard, is when you think about the shape of the likelihood function, if

Â you have multiple peaks in the likelihood function that makes life difficult for

Â most approximate algorithms and the sharper these peaks the more complicated

Â things get. And so that is and, and multiple peaks

Â are generated by strong influences that go in opposite direction.

Â And so that's really where a lot of the complexity and approximate inference

Â comes from. Okay, so now what assuming you have a

Â model that has these problematic these problematic issues.

Â Well so how do we address them? First, is to look at the network and

Â identify the problem regions, that is subset of variables that are tightly

Â coupled and maybe are subject to opposing influences.

Â And then we try and think of how we might make the inference in these problem

Â regions. More exact, so if here we have some

Â tightly coupled region with maybe opposing influences. How do we prevent

Â our approximate inference algorithm from Falling into the trap of, of, of this

Â particular area being giving imprecise answers or, or lack of convergence?

Â So for example, for doing cluster graph methods we can put this entire problem

Â region in a cluster. It costs us something on the

Â computational side because we have to deal with larger clusters.

Â But it might be well worth it in terms of the improvement of performance.

Â In sampling methods, you might consider having proposal moves over multiple

Â variables. So, instead of sampling individual

Â variables, we can sample this entire block.

Â Using again, a somewhat more expensive sampling procedure, but again that might

Â end up being well worth while in terms of the overall performance profile of the

Â algorithm. And finally, if we're thinking of this in

Â the context of a math problem, we can put this entire set of variables in a single

Â slave, which again, costs us something, on the computational side, but can

Â significantly improve the convergence profile of the algorithm.

Â So really when we're faced with a problematic model, one that doesn't

Â immediately succumb to traditional inference techniques, it turns out that

Â the design of inference is often a bit of an art.

Â That is we need to study our model in depth, think about how to deal with

Â different pieces of it, and which inference method is best equipped to

Â handle the different pieces. And we often find that complicated

Â models. You can get the best performance by a

Â combination of different inference algorithms, where some pieces are handled

Â using exact inference, such as these larger plots, in the context perhaps of

Â an approximate inference method such as sampling or belief propagation.

Â And so one needs to think creatively about the inference problem in the

Â context of these more problematic models.

Â