So over the course of the last few lectures, we've talked about a number of different inference strategies and how they might be applied to a range of different graphical models. One question that comes up is whether the same strategies can be used for the kinds of template-based models that we defined earlier in the course, models that are defined via, say, dynamic Bayesian networks or via plate models, or one of the other representations that uses repeated structure. So, as we'll see, the answer to that is both yes and no. That is, the same strategies can be applied, but they're also some interesting new challenges that arise from this. So, first, let's remind ourselves of how, for example, a dynamic base net is specified. So as a reminder, a dynamic base net is specified using a initial time 0 distribution, which is usually a Bayesian network as well as the 2 TBN that tells us the transition level. And it's certainly the case that one can take those two pieces and construct an unrolled or ground Bayesian network. And once we've done that, that ground Bayesian network is a Bayesian network just like any other. And so whatever inference techniques we have developed for answering queries for any Bayesian network can be used in this context as well. So, specifically, once we unroll the DBN for given trajectory length, we can run your favorite inference algorithm over the ground network to answer questions, such as what is the location of the car at time 2, given whatever evidence we might like to give it, so observations up to time 2 or observations for the entire duration of the sequence. The same thing happens when we look at plate models. If this is the simple plate model that we've done for our university example, where we have students and courses and grades, then we can unroll that network to produce the ground network over a subset of students, in this case, it's very simple example to horses and to students. And that unrolled network is, once again, the Bayesian network, just like any other. And so we can run inference over that network to answer questions about what is our predicted distribution over the intelligence of a given student given some information about grades, for example. What are some of the challenges though? So part of the difficulty or new dimensions that are offered by some of these models is the fact that they establish new problems. This is particularly true in the context of temporal models, where we might often be interested in what you might call belief state tracking. That is keeping track over the state of the system as it evolves. Now in some ways, this is just a traditional probabilistic inference task because it corresponds to asking what is our probability distribution over the state of time t, given the observations that the agent has had access to up to time t. So this over here, what we see as o(1:t) is just a shorthand for o1, o2, up to ot. And so this is an inference task that can be answered using standard probabilistic inference techniques over the unrolled network. But there is a challenge that needs to be addressed if you want to keep track of this probability distribution over the course of a long trajectory without having a potentially unboundedly large network that we have to keep running inference over. So it turns out, fortunately, that one can do this in a dynamic way without keeping track of the huge unrolled network at all points in time. And this just comes directly out as a consequence of the Markov properties of the graphical model. So, specifically, we're going to do this as a two stage process, where the first thing we're going to compute is this sigma(dot t + 1). And the position of the dot indicates that although it's a distribution over the state at time t + 1, it's a distribution that doesn't take into account the time t + 1 observation. And so this is defined to be, as written here, the probability of the state of time t + 1, given the observations up to time t. And so now we can do fairly straightforward manipulations of this probabilistic expression. So the first thing is we inject S(t) into the right hand side of the conditioning bar and sum out over it, so this is now a probability of S(t + 1), given S(t) and observations up to time t times the probability of S(t) given the observations up to time t. And now we can apply indepedencies that arise from the structure of the graphical model. And specifically, the fact that the state of time t + 1 is independent of everything that occurred before given a state of time t, which is going to allow us to use conditional independence to basically simplify the expression as follows, just a probability of S(t+1) given S(t), which is just our transitional model. And on the second term, which is probability of S(t) given o(1:t), this is just a previously computed belief state, as it's called, the sigma(t) that we are trying to keep track of. And so this allows us to take sigma(t) and produce sigma(dot t + 1). The second step is now to count for the observation model at time t + 1. So, let's look at what that would involve. So, now, we have sigma(dot t + 1), which we defined on the previous slide, and so now we want to condition that on the new piece of observation, of the observation to find t + 1. And this is defined, so just, from that to derive the belief state, sigma(t + 1). And so here we're just going to take that definition, probability of S(t + 1) given all of the observations, the ones that went through t and the new one. And we're going to apply Bayes' rule to define this and to reformulate this, and that gives us probability of o(t + 1), given S(t + 1) and the previous observations, times the probability of S(t + 1), given o(1:t), divided by the probability of o(t + 1), given o(1:t). This is a straightforward application of Bayes' rule, and once again, we can now examine each of the terms in this expression and see what it corresponds to. And looking first at this first term in the numerator, we can see that this is the probability of the observation at time t given the state of time t and a bunch of things that happened in the past. And once again, conditional independence tells us that the stuff that happened in the past is irrelevant and to be removed because of conditional independencies. The second term, which is this one, is the one that we just computed. It's sigma(dot t + 1). And so, that gives us this expression over here, which is very easy to deal with because each term in the numerator is either something that is part of our model, the observation model, in this case, or something that we've computed before, sigma(dot t + 1). Now, you might wonder about how we get this somewhat scary looking denominator. But the important thing to remember is, as in previous examples that we've seen, this denominator is simply a normalizing constant, Which can be derived by computing the numerator, And then normalizing. And so, it's not really scary at all. So going back to an example that we've seen before, let's look at the example of robot localization. This is a model that we've already discussed, so we're not going to talk about it anymore. But now let's see how this might manifest in the context of an actual belief state tracking problem. So, here is a situation where this model might be applied, so let's first explain what we're seeing before we see a demo. What you see in the green circle is the robot true position, which is not known to the robot, because the robot's trying to localize itself. These blue lines are the observed sonar readings that the robot gets at each point in the process. And so you can see that the returned length of the sonar readings correspond roughly to the distance of the robot from the wall. But there is some noise here. So for example, sometimes for whatever reason, the sonar appears to go through the wall even though it should have been giving us a reading that is considerably closer because it should have been reflected from the surface. So sonars are quite noisy, and the visualization that we see reflects that. The red dots that you see is a visualization of the robot's beliefs over where it is. In the initial stage this is a uniform probability distribution and because the robot doesn't know where it is. And as we'll see, this gets clumpier and clumpier over time as the robot's belief state over where it's likely to be becomes more and more definitive. So let's look at what this looks like over time. So what we're going to see is the robot's beliefs as it gets more and more observations. We can see that the distribution becomes clumpier and clumpier, and pretty sure the robot's sure that it's in the hallway. But because of symmetries, it's not sure which side of the hallway it is, so we have these two clumps representing these two modes of the distribution. But now the robot's going to walk into this room. And notice that this room over here is different from the room at the bottom. So right now, the robot has localized itself with certainty, or close to certainty, because only one position is consistent with the observations that it's received over the course of its trajectory. The second challenge associated with these template-based models are computational issues. So sure we can produce an unrolled Bayesian network and compute a posterior over any subset of the variables using standard inference techniques. But one of the consequences of being able to produce these very large probabilistic graphical models from a fairly small template is that we can produce very large probabilistic graphical models from a very small template. And large probabilistic graphical models can pose new inference challenges in terms of the ability to scale inference to models of that size. So, specifically, if we look at the unrolled model that arises from an unrolled DBN and thinking back to some of the analysis that we did regarding the complexity of probabilistic inference for a particular probabilistic graphical model, we remember, for example, that if we want to run inference, the exact inference, say, a clique tree over this unrolled network. Then for example, if we want this to have a clique tree where, say, the time 0 variables are in one part of the model and the time t variables for some future t are on some other clique and all, so they're not all together in one big clique. Then the minimal subset needs to separate these variables over here. And we say the blue variables, the time 0 variables, from the green variables. So the subset must separate them. And if you think about what separation imposes on us in this setting, we can see that the only way to separate the, for example, these variables, the blue variables over here, from the green variables over here is to put in the subset at least the persistent variables, that is the ones where there is an edge from the variable at time t to its incarnation at time t + 1. And so the minimal subset in this context, the smallest subset that we can construct that would allow us to have different cliques for, say, time 0 and time 2, is something that involves all of these variables in the middle. And so, that can potentially involve a significant computational cost for exact inference, especially when we have a large number of persistent variables. A different way of understanding this is via the concept of entanglement, if we're thinking of belief state tracking. So if our goal is to maintain a belief state over, say, the variables of time 3, and we're trying to think about how can we maintain this probability distribution, the sigma of time 3, in a way that doesn't involve maintaining a full explicit joint distribution over the variables time 3. We quickly realize that, really, we have no choice, because there are no conditional independencies within, Time 3. Now you might say, how is that? This is a nicely structured model. Why wouldn't there be conditional independencies? Well, look at what's going on here. Can we say that weather is, at time 3, is conditionally independent of failure at time 3? Well, locally, they're not connected to each other within the time slice, but there's certainly an active trail between them that goes, for example, like this, from weather time 3, weather time 2, weather time 1, failure time 2, to failure time 3. And so, this is an active trail between these two variables, which means that they are, when you consider only the variables at time 3, not conditionally independent of each other, given any of the other variables at time 3. And so, this entanglement process occurs very rapidly over the course of tracking a belief state in a dynamic Bayesian network, which eventually means that the belief state, if you want to maintain the exact belief state, is just fully correlated, in most cases. And so, this is a computational consequence of the fact that we have a very large Bayesian network. And if you want to think about just how large it can get, we can go back to the real vehicle tracking example that we talked about in a previous lecture, and note that all of these variables over here are all persistent variables. And so, belief state tracking would have to maintain if it were to be done exactly a full joint distribution over all of these variables, which is likely to be intractable, certainly if you want realtime performance and probably not even without that constraint. And so, there has been a lot of work on approximate methods in the context of, say, temporal models, because of the computational issues that they raise. Computational issues like that also occur in plate models. So here is the plate model unrolled for our student example with the difficulty over here and the intelligence on the right and the grades in the middle. And these are all observed so I've not put in the actual variables because it would just clutter things up. And we've talked about the computational complexity of this bipartite, MRF, which is what we really have here, where you have the variables on the left are connected to the variables on the right. And in general, assuming that the grades are fairly dense, we mentioned before that for exact inference, the lowest cost that we can expect is if you have m variables on the left and n variables on the right, it's the minimum of (m, n), so, the minimum of the number of courses and the number of students. And again, you can come up with cases where this is going to be lower than that. But in general, for bipartite MRF like this, that's what you're going to get. And assuming that the university has a reasonably high number of courses and presumably even more students, this can become intractable very quickly, giving rise to the need for approximate inference methods. That said, approximate inference methods can be very successful. So let's just conclude this with one more example. So here is an example from, in this case, an undirected template model where we have repeated structure. So here the task is to classify web pages into categories, where you want to categorize web pages. This is a project called Web KB that was constructed by Tom Mitchell and his group at Carnegie Mellon University, and what we see here that you'd like to categorize web pages as corresponding to students, faculty, courses, and so on. And while one can use standard machine learning techniques to categorize webpages individually, say, by the words that they contain, it turns out one can get significant improvement in performance by considering the connections between the web pages. So specifically, by considering edges between, in this case, undirected edges between the category of pages that are linked to each other. So, for example, we see, and this is a model that was learned from Dana, that students are quite likely to point to faculty, and faculty are somewhat less likely to point to students. And faculty are even less likely to point to other faculty. And so that gives us some information about the correlations between labels of web pages that are linked to each other and gives rise to improvements in performance that one gets that are actually quite considerable. So, for example, a reduction in error from 18% to a little bit over 12%. To summarize, inference in template and temporal models can be done in principle by unrolling the ground network and using standard inference methods. However temporal models specifically raise new inference tasks, such as real time belief state tracking, which requires a certain adaptation in our methods. And furthermore, a second source of complexity here is that the ground network that we get by enrolling these models is often very large and sometimes quite densely connected, requiring careful design of algorithmic methods and use of approximate methods that can allow us to deal with the increase in scale and complexity.