There are many classes of models that that allow us to represent in a single concise representation, a template over riched models that incorporate multiple copies of the same variable and also allow us to represent multiple models within as a byproduct of a single representation. But one of the most commonly used among those is for reasoning about template models where we have a system that evolves over time. So so, let's look at what we have, we represent a distribution over template trajectories. So the first thing we want to do when representing a distribution over continuous time is, in most cases, not always, is to try and forget the time is actually continuous. Because continuous quantities are harder to deal with. So, we're going to discretize time. And specifically, i is claimed. And specifically, we're going to do that by picking a particular time granularity delta, which is the time granularity at which we're going to measure time. Now, in many cases, this is something that we already is given to us by the granularity of our sensor. So, in many cases, for example, we have a video or a robot there is a certain time granularity at which we obtain measurements and so that's usually the granularity that we'll pick. But in other cases, we might want to have have a different a different granularity so there is a choice here. So, here's our time granularity, for example. And now, we have a set of template random variables, X of t. And X of t denotes the value of a particular variable x, variable x being a template variablem x to the t being a particular instantiation of that variable at the time point t delta so that we have multiple copies, one for each time point. Now, here's some notation that we're going to end up using later on. So, let's introduce it. where X of t denotes the variable t at time t x(t:t) prime denotes the set of between t and t prime. So, a discrete, in this case, because we've discretized time. So, a finite set of random variables that spans these two time points, inclusive. Now, our goal is that we would like to be able to have a concise representation that allows us to represent this probability distribution over the trajectory, over a trajectory of the system of any duration. So, we want to start at a particular time point. Usually, this is going to be zero. And then, how what is the probability distribution over trajectories of arbitrary length? So, how do we represent what is a, first of all, an infinite family of probability distributions because you could look at trajectories of duration 2, 5, 10, a million. So, that's an infinite family of distributions. And each of these is a distribution over an unbounded number of random variables. Because if you have a distribution over a trajectory length a million, you have to represent a million dimensional distribution. So, how do we compactify that, how do we make that a much more concise representation? So, there is different pieces to this. The first of those is what's typically called the Markov assumption. And the Markov assumption goes, is effectively a type of conditional independence assumption. So, it's the same building block that we use to compactify general purpose graphical models, we're going to use here in the context of time course data. So here, we're saying that the probability of the variable, X, I'm sorry, the set of variables expanding the time from zero all the way to capital T. so, I haven't made any assumptions yet in this statement so I'm just writing this down. I'm re-expressing it in terms of the chain rule for probabilities. There is no chain rule for Bayesian networks yet here. And this is just a chain rule for probabilities. And the chain rule for probabilities in this context basically says that, that probability is equal to the probability of X at time zero times the probability of each consecutive time point, t + 1, given, so this is the state at t + 1, given the state at all previous time points, zero up to t. So, this is not in any way an assumption. This is just a way of re-expressing this probability distribution in the way that time flows forward. But it's not an assumption. You can represent any probability distribution over these random variables in this way. But now, we're going to add this assumption. And this is an assumption. This is an independence assumption. And this independence assumption tells me that X of t plus one, that is the state of time t + 1, the next step. So, this is the next step, is independent of the past, given the present. So, this is a forgetting assumption. Once you know the current state, you don't care anymore about your past, okay? If you do that, we can now go back to this chain rule over here and simplify it, because whereas before, we conditioned on x upto zero, some times zero to time t, now everything upto t - 1 is conditionally independent, given. So, all of this is conditionally independent of t plus one given X of t, which means that I've allowed myself to move X of t here as the only thing that I'm conditioned on, conditioning on, in order to determine the probability of distribution of X = 1. So, to what extent is this assumption warranted? So, is this true? And let's take as an example, X equals the location or pose of a robot or an object that's moving. Is it the case that the location of the robot at t plus one. So, L of t plus one plus one is independent, of say, L of t minus one, to simplify our lives, given L of t. Is this a reasonable assumption? Well, in most cases, probably not. And the reason is that it completely ignores the issue of velocity, which direction are you moving and how fast? And so, this is a classical example of where the Markov assumption for this particular model is probably too strong of an assumption. So, what do we do to fix it? The one, one way to fix it is to enrich the stay description. So, estimate the Markov assumption a better approximation. Just like any independent assumption, the Markov assumption is always going to be an approximation but the question is how good of that approximation and if we add, for example, V of t, which is the velocity times t, maybe the exploration of time t, maybe the robot's intent, where its goal is. I mean, all sorts of additional stuff into the state. Then, at that point, the Markov assumption becomes much more warranted, okay? And so that's one way of making the Markhov assumption true. An alternative strategy which we're not going to talk about right now is to move away from the Markov assumption by adding dependencies that go further back in time, back in time. And that's called a semi-Markov model. And we're not going to talk about that right now. The second big assumption that we are going to have to make in order to simplify the model is in order to deal with the question of, well, fine, so we've reduced the model to encoding a probability of X of t + 1, given X of t, but it's still an unbounded number of conditional probabilities. Now, at least each of them is compact, but there's still a probabilistic model for every P. And this is where we're going to end up with a with a template based model. We're going to stipulate that there is a probabilistic model, P p of x prime given x. X prime denotes the, next time point and X denotes the current time point. And we're going to assume that that model is replicated for every single time point. That is, when you're moving from time zero to time one, you use this model. When you're moving time one to time two, you also use this model. And and that assumption, for obvious reasons, is called time invariance. Because it assumes that the dynamics of the system, not the actual position of the robot, but rather the dynamics that move it from state to state, or the dynamics of the system, don't depend on the current time point t. And once again, this is an assumption, and it's an assumption that's warranted in certain cases and not in others. So, let's imagine that this represents now the traffic on some road. Well does that traffic, does the, does the dynamics of that traffic depend on on, say, the current time point of the system? On most roads, the answer is probably yes. It might depend on the time of day, on the day of the week, on whether there is a big football match, on all sorts of things that might affect the dynamics of traffic. The point being that just like in the previous example, we can correct inaccuracies in our assumption by enriching the model. So, once again, we can enrich the model by including these variables in it. And once we have that, then the, again, the model becomes a much better reflection of reality. So now, how do we represent this probabilistic model in in the context of a graphical model like we had before? So, let's now assume that our stayed description is composed of a set of random variables. And so, we're interested, we have we have a little baby traffic system where we have the weather at the current time point, the location of, say, a vehicle, the velocity of the vehicle. We also have a sensor, who's observation we get at each of those time points. And the sensor may or may not be failing at the current time point. And what we've done here is we've encoded the the probabilistic model of this next state. So, W prime, V prime, L prime, F f prime, and O prime, given the previous states. So, given W, V, L, and F. Why is O not here on the right-hand side? It's not here on the right-hand side because it doesn't affect any of the next state variables. So, it would be kind of hanging down here if we included it. But that doesn't, it doesn't affect anything, we don't choose to to represent it. So, this model represents a conditional distribution. Now, we have a little network fragment. So, this is a network fragment. And it doesn't represent a joint distribution, it represents a conditional distribution. The conditional distribution of the t + 1 given t. But what, but in order to represent that, we still use the same tools that we have in the context of variance, of graphical models. And so, we can write that as the same kind of chain rule that we used before. So, this would be the probability of W prime, given W, based on this edge over here, times the probability of V prime, the velocity. So, this, this says that the weather, the first one says, that the weather at time t plus one depends on the weather at time t. The second one that the velocity of time t plus one depends on the weather at time t and the velocity at time t which indicates a certain persistence in the velocity as well as the fact that, you know, if there, if it's raining you might slip sideways so the velocity might change. Also if you're careful, you might slow down if it's raining. And so again, there might be an effect of the weather on the velocity. The probability of the location at time t + one, given the location at time t and the velocity time t. The probability of a sensor failure at time t1. + 1, given the failure, and at, at the previous time and the weather. Which indicates that, once the sensor has failed, it's probably more likely to stay failed. But maybe rain can make the sensor behave badly. And then, finally, the probability of the observation of time t + 1 given the location of time t1. + 1, and the failure of time t1. + 1. So, there's several important things to note about this diagram that are worth highlighting. First of all, we have dependencies both within and across time. So here, we have a dependency that goes from t to t plus one. And here, we have a dependency that is within t T plus one alone. What's, what induces us to make a modeling chose like this go one way versus the other? The assumption here that this is a fairly wide [UNKNOWN] dependency so that a, the observation is relatively instantaneous compared to our time granularity. And so, we, we don't want that to go across time but rather we want it to be within a time slice because it's a better reflection for which variable is it that actually influences the observation. Is it the current location or the previous location? So these kinds of edges, let's just give the names. These are called intra-time slice edges and these are called inter or between time slice. And the model can include a combination of both of these. Another kind of anothwe, a particular type of inter-time slice edge that's worth highlighting specifically are edges that go from a variable at one time point to the value of that variable at the next time point. These are often called persistence edges because they indicate the the tendency of a variable to persist in state from one time point to another. Finally, let's just go back and look at the parameterization that we have in this model. So, what CPDs did we actually need to include in this model? And we can see that we have CPDs for the variables on the right-hand side, the prime variables. But there's no CPDs for the variables that are unprimed, the variables on the left. And this is because the model doesn't actually try and represent the distribution, O over W, V, L, and F. It doesn't try and do that. It tries to represent the probability of the next time slice, given the previous one. So, as we can see, this graphical model only has CPD's for a subset of the variables in it. The ones that represent the next time point. So, that represents the transition dynamics. If we want to represent the probability distribution over an entire system, we also need to provide a distribution over the initial state. And this is just the standard generic Bayesian network which represent the probability over the state at times zero using some appropriate chain rule. So, nothing very fancy here. With those two pieces, we can now represent probability distributions over arbitrarily long trajectories. So, we represent this by taking for time slice zero and copying the times zero Bayesian network, which represent the probability distribution over the time zero variables. And now, we have a bunch of copies that represent the probability distribution at time one, given time zero. And here, we have another copy of exactly the same set of parameters that represents time two given time one. And we can continue copying this indefinitely and each copy gives us the probability distribution of the next time slice given the one that we just had and so we can construct arbitrarily along Bayesian network. So, to make this definition slightly more formal, we define the notion of a two-time slice Bayesian network, also known as a 2TBN. And the 2TBN over a set of template variables X1 up to Xn, is specified as a Bayesian network fragment along exactly the same lines that we used in the example. The nodes have two copies. the next time state variable X prime up to Xn prime. And some subset of X1 up to Xn, which are variables, the time t variables, that affect directly the state of three plus one, okay? And because we want this to represent a conditional probability distribution, only the time t + 1 nodes have parents and a CPD. Because we don't really want to model the distribution over the variables of time t. And the 2TBN defines a conditional distribution using the chain rule. You can tell me it looks exactly like the chain rule. So, the probability of X prime, given X is the product of each variable and time t plus one, so only the prime variables can have parents. Which may or be, may be in time P plus one, time P, or a combination of both. A dynamic Bayesian network is now basically defined by a 2TBN, which we just defined, and a Bayesian network over times zero. So, this is the dynamics and this is the initial state. And we can use that to define arbitrary probability distributions for sorry, probability distributions over arbitrarily long trajectories using what's called the unrolled network or also called the ground network. And this is exactly in the, as in the example that I showed, the dependency model for time zero is copied from the Bayes net for time zero. And this is, the transition is copied from the Base N under the nodes, the conditional probability for transitions. So, before we conclude this lecture, let's look at an example of a dynamic Bayesian network that is a more realistic one than the simple examples that we've shown before. This is a network that was actually designed for tracking vehicles in a traffic situation. And so, we can see that there are multiple variables here that represent both the position and velocity of the vehicle both in an absolute sense. For example, this is Xdot and Ydot are the velocities as well as various more semantic notions of location, like whether you're in the lane. There are contextual variables such as left clear and right clear. the engine status, for example as well as what the driver is currently doing, for example the forward action and the lateral action. We can see that there are persistence edges that denote the persistence of various forms of the state from time t to time t plus one, as well as a variety of these intermediate variables over here that allow us to represent the probability distribution in a more compact way by incorporating variables that do not persist, or at least in this simplified model,, do not persist. And finally, we see that there are a large number of sensor observations, such as turn signal, whether the car is clear on the right and on the left or appears to be clear on the right and left, and so, and so on. So, this is a much more realistic model of how traffic evolves than the simplified one that we saw before. To summarize, dynamic Bayesian networks provides us with a language for encoding structured distributions over time. And by making the assumptions of the Markovian evolution as well as time invariance, you can use a single compact network to allow us to code arbitrarily long transitions over arbitrarily long time sequences. But these assumptions that we made, the Markovian assumption and the time invariance assumption are not really correct and might require that we redesign the model so as to enforce, so as to make these assumptions a better approximation to the two underlying distribution, by, for example, adding variables as we showed. [SOUND]