0:32

So so, let's look at what we have, we represent a distribution over template

Â trajectories. So the first thing we want to do when

Â representing a distribution over continuous time is, in most cases, not

Â always, is to try and forget the time is actually continuous.

Â Because continuous quantities are harder to deal with.

Â So, we're going to discretize time. And specifically,

Â i is claimed. And specifically, we're going to do that

Â by picking a particular time granularity delta,

Â which is the time granularity at which we're going to measure time.

Â Now, in many cases, this is something that we already is given to us by the

Â granularity of our sensor. So, in many cases, for example, we have a

Â video or a robot there is a certain time granularity at which we obtain

Â measurements and so that's usually the granularity that we'll pick.

Â But in other cases, we might want to have have a different a different granularity

Â so there is a choice here. So, here's our time granularity, for

Â example. And now, we have a set of template random

Â variables, X of t. And X of t denotes the value of a

Â particular variable x, variable x being a template variablem x to the t being a

Â particular instantiation of that variable at the time point t delta so that we have

Â multiple copies, one for each time point. Now, here's some notation that we're

Â going to end up using later on. So, let's introduce it.

Â where X of t denotes the variable t at time t x(t:t) prime denotes the set of

Â between t and t prime. So, a discrete, in this case, because

Â we've discretized time. So, a finite set of random variables that

Â spans these two time points, inclusive. Now, our goal is that we would like to be

Â able to have a concise representation that allows us to represent this

Â probability distribution over the trajectory,

Â over a trajectory of the system of any duration.

Â So, we want to start at a particular time point.

Â Usually, this is going to be zero. And then, how what is the probability

Â distribution over trajectories of arbitrary length?

Â So, how do we represent what is a, first of all, an infinite family of probability

Â distributions because you could look at trajectories of duration 2, 5, 10, a

Â million. So, that's an infinite family of

Â distributions. And each of these is a distribution over

Â an unbounded number of random variables. Because if you have a distribution over a

Â trajectory length a million, you have to represent a million dimensional

Â distribution. So, how do we compactify that, how do we

Â make that a much more concise representation?

Â So, there is different pieces to this. The first of those is what's typically

Â called the Markov assumption. And the Markov assumption goes, is

Â effectively a type of conditional independence assumption.

Â So, it's the same building block that we use to compactify general purpose

Â graphical models, we're going to use here in the context of time course data.

Â So here, we're saying that the probability of the variable, X,

Â I'm sorry, the set of variables expanding the time

Â from zero all the way to capital T. so, I haven't made any assumptions yet in

Â this statement so I'm just writing this down.

Â I'm re-expressing it in terms of the chain rule for probabilities.

Â There is no chain rule for Bayesian networks yet here.

Â And this is just a chain rule for probabilities.

Â And the chain rule for probabilities in this context basically says that, that

Â probability is equal to the probability of X at time zero times the probability

Â of each consecutive time point, t + 1, given, so this is the state at t + 1,

Â given the state at all previous time points, zero up to t.

Â So, this is not in any way an assumption. This is just a way of re-expressing this

Â probability distribution in the way that time flows forward.

Â 4:46

But it's not an assumption. You can represent any probability

Â distribution over these random variables in this way.

Â But now, we're going to add this assumption.

Â And this is an assumption. This is an independence assumption.

Â And this independence assumption tells me that X of t plus one,

Â that is the state of time t + 1, the next step.

Â So, this is the next step, is independent of the past,

Â given the present. So, this is a forgetting assumption.

Â Once you know the current state, you don't care anymore about your past,

Â okay? If you do that, we can now go back to

Â this chain rule over here and simplify it, because whereas before, we

Â conditioned on x upto zero, some times zero to time t,

Â now everything upto t - 1 is conditionally independent, given.

Â So, all of this is conditionally independent of t plus one given X of t,

Â which means that I've allowed myself to move X of t here as the only thing that

Â I'm conditioned on, conditioning on, in order to determine the probability of

Â distribution of X = 1. So, to what extent is this assumption

Â warranted? So, is this true?

Â And let's take as an example, X equals the location or pose of a robot or an

Â object that's moving. Is it the case that the location of the

Â robot at t plus one. So, L of t plus one plus one is

Â independent, of say, L of t minus one, to simplify our

Â lives, given L of t.

Â Is this a reasonable assumption? Well, in most cases, probably not.

Â 8:38

a probability of X of t + 1, given X of t, but it's still an unbounded number of

Â conditional probabilities. Now, at least each of them is compact,

Â but there's still a probabilistic model for every P.

Â And this is where we're going to end up with a with a template based model.

Â We're going to stipulate that there is a probabilistic model, P p of x prime given

Â x. X prime denotes the, next time point and

Â X denotes the current time point. And we're going to assume that that model

Â is replicated for every single time point.

Â That is, when you're moving from time zero to time one, you use this model.

Â When you're moving time one to time two, you also use this model.

Â And and that assumption, for obvious reasons, is called time invariance.

Â Because it assumes that the dynamics of the system, not the actual position of

Â the robot, but rather the dynamics that move it from

Â state to state, or the dynamics of the system, don't depend on the current time

Â point t. And once again, this is an assumption,

Â and it's an assumption that's warranted in certain cases and not in others.

Â So, let's imagine that this represents now the traffic on some road.

Â Well does that traffic, does the, does the dynamics of that traffic depend on

Â on, say, the current time point of the system?

Â On most roads, the answer is probably yes.

Â It might depend on the time of day, on the day of the week,

Â 10:17

on whether there is a big football match, on all sorts of things that might affect

Â the dynamics of traffic. The point being that just like in the

Â previous example, we can correct inaccuracies in our assumption by

Â enriching the model. So, once again, we can enrich the model

Â by including these variables in it. And once we have that, then the, again,

Â the model becomes a much better reflection of reality.

Â So now, how do we represent this probabilistic model in in the context of

Â a graphical model like we had before? So, let's now assume that our stayed

Â description is composed of a set of random variables.

Â And so, we're interested, we have we have a little baby traffic system where we

Â have the weather at the current time point, the location of, say, a vehicle,

Â the velocity of the vehicle. We also have a sensor, who's observation

Â we get at each of those time points. And the sensor may or may not be failing

Â at the current time point. And what we've done here is we've encoded

Â the the probabilistic model of this next state.

Â So, W prime, V prime, L prime, F f prime, and O prime, given the previous states.

Â So, given W, V, L, and F. Why is O not here on the right-hand side?

Â It's not here on the right-hand side because it doesn't affect any of the next

Â state variables. So, it would be kind of hanging down here

Â if we included it. But that doesn't, it doesn't affect

Â anything, we don't choose to to represent it.

Â So, this model represents a conditional distribution.

Â Now, we have a little network fragment. So, this is a network fragment.

Â And it doesn't represent a joint distribution,

Â it represents a conditional distribution. The conditional distribution of the t + 1

Â given t. But what, but in order to represent that,

Â we still use the same tools that we have in the context of variance, of graphical

Â models. And so, we can write that as the same

Â kind of chain rule that we used before. So, this would be the probability of W

Â prime, given W, based on this edge over here,

Â times the probability of V prime, the velocity.

Â So, this, this says that the weather, the first one says, that the weather at time

Â t plus one depends on the weather at time t.

Â The second one that the velocity of time t plus one depends on the weather at time

Â t and the velocity at time t which indicates a certain persistence in the

Â velocity as well as the fact that, you know, if there, if it's raining you might

Â slip sideways so the velocity might change.

Â Also if you're careful, you might slow down if it's raining.

Â And so again, there might be an effect of the weather on the velocity.

Â The probability of the location at time t + one, given the location at time t and

Â the velocity time t. The probability of a sensor failure at

Â time t1. + 1, given the failure, and at, at the

Â previous time and the weather. Which indicates that, once the sensor has

Â failed, it's probably more likely to stay failed.

Â But maybe rain can make the sensor behave badly.

Â And then, finally, the probability of the observation of time t + 1 given the

Â location of time t1. + 1, and the failure of time t1.

Â + 1. So, there's several important things to

Â note about this diagram that are worth highlighting.

Â First of all, we have dependencies both within and across time.

Â So here, we have a dependency that goes from t to t plus one.

Â And here, we have a dependency that is within t T plus one alone.

Â What's, what induces us to make a modeling chose like this go one way

Â versus the other? The assumption here that this is a fairly

Â wide [UNKNOWN] dependency so that a, the observation is relatively instantaneous

Â compared to our time granularity. And so, we, we don't want that to go

Â across time but rather we want it to be within a time slice because it's a better

Â reflection for which variable is it that actually influences the observation.

Â Is it the current location or the previous location?

Â So these kinds of edges, let's just give the names.

Â These are called intra-time slice edges and these are called inter or between

Â time slice. And the model can include a combination

Â of both of these. Another kind of

Â anothwe, a particular type of inter-time slice edge that's worth highlighting

Â specifically are edges that go from a variable at one time point to the value

Â of that variable at the next time point. These are often called persistence edges

Â because they indicate the the tendency of a variable to persist in

Â state from one time point to another. Finally, let's just go back and look at

Â the parameterization that we have in this model. So, what CPDs did we actually need

Â to include in this model? And we can see that we have CPDs for the

Â variables on the right-hand side, the prime variables.

Â But there's no CPDs for the variables that are unprimed,

Â the variables on the left. And this is because the model doesn't

Â actually try and represent the distribution, O over W, V, L, and F.

Â It doesn't try and do that. It tries to represent the probability of

Â the next time slice, given the previous one.

Â So, as we can see, this graphical model only has CPD's for a subset of the

Â variables in it. The ones that represent the next time

Â point. So, that represents the transition

Â dynamics. If we want to represent the probability

Â distribution over an entire system, we also need to provide a distribution over

Â the initial state. And this is just the standard generic

Â Bayesian network which represent the probability over the state at times zero

Â using some appropriate chain rule. So, nothing very fancy here.

Â 16:56

With those two pieces, we can now represent probability distributions over

Â arbitrarily long trajectories. So, we represent this by taking for time

Â slice zero and copying the times zero Bayesian network, which represent the

Â probability distribution over the time zero variables. And now, we have a bunch

Â of copies that represent the probability distribution at time one, given time

Â zero. And here, we have another copy of exactly

Â the same set of parameters that represents time two given time one.

Â And we can continue copying this indefinitely and each copy gives us the

Â probability distribution of the next time slice given the one that we just had and

Â so we can construct arbitrarily along Bayesian network.

Â So, to make this definition slightly more formal, we define the notion of a

Â two-time slice Bayesian network, also known as a 2TBN.

Â And the 2TBN over a set of template variables X1 up to Xn, is specified as a

Â Bayesian network fragment along exactly the same lines that we used in the

Â example. The nodes have two copies.

Â the next time state variable X prime up to Xn prime.

Â And some subset of X1 up to Xn, which are variables, the time t

Â variables, that affect directly the state of three

Â plus one, okay?

Â 18:44

And because we want this to represent a conditional probability distribution,

Â only the time t + 1 nodes have parents and a CPD.

Â Because we don't really want to model the distribution over the variables of time

Â t. And the 2TBN defines a conditional

Â distribution using the chain rule. You can tell me it looks exactly like the

Â chain rule. So, the probability of X prime, given X

Â is the product of each variable and time t plus one, so only the prime variables

Â can have parents. Which may or be, may be in time P plus

Â one, time P, or a combination of both. A dynamic Bayesian network is now

Â basically defined by a 2TBN, which we just defined, and a Bayesian

Â network over times zero. So, this is the dynamics and this is the

Â initial state. And we can use that to define arbitrary

Â probability distributions for sorry,

Â probability distributions over arbitrarily long trajectories using

Â what's called the unrolled network or also called the ground network.

Â And this is exactly in the, as in the example that I showed,

Â the dependency model for time zero is copied from the Bayes net for time zero.

Â And this is, the transition is copied from the Base N under the nodes, the

Â conditional probability for transitions. So, before we conclude this lecture,

Â let's look at an example of a dynamic Bayesian network that is a more realistic

Â one than the simple examples that we've shown before.

Â This is a network that was actually designed for tracking vehicles in a

Â traffic situation. And so, we can see that there are

Â multiple variables here that represent both the position and velocity of the

Â vehicle both in an absolute sense. For example, this is Xdot and Ydot are

Â the velocities as well as various more semantic notions of location, like

Â whether you're in the lane. There are contextual variables such as

Â left clear and right clear. the engine status, for example as well as

Â what the driver is currently doing, for example the forward action and the

Â lateral action. We can see that there are persistence

Â edges that denote the persistence of various forms of the state from time t to

Â time t plus one, as well as a variety of these

Â intermediate variables over here that allow us to represent the probability

Â distribution in a more compact way by incorporating variables that do not

Â persist, or at least in this simplified model,, do not persist.

Â And finally, we see that there are a large number of

Â sensor observations, such as turn signal, whether the car is clear on the right and

Â on the left or appears to be clear on the right and left, and so, and so on.

Â So, this is a much more realistic model of how traffic evolves than the

Â simplified one that we saw before. To summarize, dynamic Bayesian networks

Â provides us with a language for encoding structured distributions over time.

Â And by making the assumptions of the Markovian evolution as well as time

Â invariance, you can use a single compact network to allow us to code arbitrarily

Â long transitions over arbitrarily long time sequences.

Â