0:00

One of the most generally useful class of sampling methods one that's very commonly

Â used in practice is the class of Markov Chain Monte Carlo methods.

Â And those are methods that allows us to design an intuitive sampling process that

Â through a sequence of steps allows us to generate a sample from a desired target

Â distribution that might be intractable to sample from directly.

Â So what are these Markov chains and how do you use them.

Â So, a Markov Chain is a, is a method for

Â sampling from a distribution p that is intractable to sample from.

Â And so we see one example of such a distribution p.

Â if you'd like to sample from a from say Bayesian network.

Â Where we have some evidence. We don't really know how to sample from

Â that. If you'd like to sample from a Markov

Â network, we don't really know how to sample from that either, in general.

Â And so here we have examples of distributions p.

Â And we'd like to come up with a way of generating even one sample from that

Â distribution p. So mark up chain gives us a general

Â mechanism for doing that. And what the Markov Chain does is it

Â defines an iterative process by which the first sample that you generate is not

Â going to be from a distribution p but ultimately, as you move along,

Â you are going to get closer and closer to generating a sample from p.

Â So, so let's understand what that what that means so we have a Markov chain, and

Â a Markov chain is defined over a state space which we are going to use x's to

Â denote. And so here is an example of such a state

Â space. This is a very simple state space, it

Â starts with the zero point over here and you can imagine, and it has, I, the, four

Â negitive numbers to the left and four positive numbers to the left.

Â And a Markov chain defines a probabilistic transition model which,

Â given that I'm at a given state, x tells me how likely I am to transition to a

Â different state, x prime. And that is a probability distribution as

Â indicated in this formula here. So that's for any x, the probability, the

Â sum over the probability over the states you can transition is exactly one.

Â So for example if in this case we have our little grass, grasshopper that starts

Â out at state zero. We can see that it has a probability of

Â 0.25 of transitioning to the right. A probability of 0.25 of transitioning to

Â the left. And a probability of 0.5 of not making

Â any progress and staying exactly where it is.

Â And in fact that same general probability distribution is actually in this example

Â replicated across the different states in the chain, with the exception of the

Â states at the end, where if the poor grasshopper tries to go to the left when

Â it's at negative four, it's going to hit the wall and it's going to end up staying

Â where it is regardless. And so the probability in this case of

Â staying is actually 0.75, corresponding to the two different cases that we just

Â talked about. But anyway, this is a Markov chain, and

Â it and you can imagine. Simulating a random process by which a

Â grasshopper traverses this chain. And so, initially, it starts out with at

Â say, at state zero. And then

Â and then, what happens, as. And then, as it moves, it selects to move

Â left with probability of quarter. Right with probability of quarter.

Â And and stay at the same place with probability 0.5.

Â Once the states move to the left, it now does exactly the same thing.

Â So let's think about the temporal dynamics of this type of process.

Â We can ask what is the probability that a time, that it's step t + 1 so this is the

Â step. what is the probability that at that

Â time-step the state, at time t1, + 1 is equal to some value x fine.

Â So we can get that by a recurrence relationship that looks at the state of

Â time T. So if we had previously computed the

Â probability distribution over where the grasshopper might be at time T.

Â We can sum up over all possible states where x, where the grasshopper might be

Â and ask if it was at state x at time t, what is the probability that it ends up

Â at being that it ends up going to x prime.

Â So this together gives me a distribution over pairs.

Â X comma X prime, where this, which, which measures the propability that at time, t

Â grasshopper is at state X and the T plus one is at X prime, and since I'm only

Â interested now in asking about T plus one, I sum up, or marginalize the time T

Â step, state X. So if we go back to our grasshopper

Â example, we can simulate this process. And here is the first three steps of it.

Â So at time zero, the grasshopper is at state zero with a probability of one.

Â At time one, it's going to be at negative one with a probability of a quarter and

Â it's positive one with a probability of a quarter, probability half it's stuck at

Â the same state. And now we can simulate the next step.

Â So, it's, at the next time step, the probability that it moves, manages to

Â move, all the way to -2 is 0.25 squared, because it considers two successful moves

Â to the left. Here you have two successful moves to the

Â right. At stage zero you have, for example, a

Â sum of different events, which is the probability that it stayed in the same

Â state twice. So.5 squared.

Â Plus the two events that correspond to moving to the left once and back to the

Â right, or moving to the right and back to the left, each of which happens with

Â probability 25^2.. So this is an example of how you would do

Â this. Now it turns out from many of these

Â chains and we'll describe conditions in a moment ultimately as the process evolves,

Â the probability distribution kind of equalizes which means that the

Â probability for the time t your at state x prime is almost the same as the

Â probability that at time t + one you're at x prime.

Â So sort of it kind of the distribution over where you might be tends to

Â equalize. And and so, we can then consider what's

Â called the limiting process, the limiting distribution that you would get as you

Â simulate the process for more and more steps.

Â And that is typically denoted by pie, which is called the stationary

Â distribution. It also has a bunch of other names.

Â But, stationary is the most common. And if you plugin.

Â You see, basically take out this approximate equality.

Â And you can see that what we have is a condition on Pi.

Â Is that the distribution of ti, at one time step is the needs to be at, the, the

Â probability of X prime in the stationary distribution needs to be equal to this

Â summation over here. Where Pi now appears both in the left

Â hand side and within the summation on the right hand side.

Â 8:42

has to be equal to 0.25 times pi of x1, because of the self transition here plus

Â oops zero point, yes, sorry.

Â pi of x1 is equal to 0.25 times pi of x1 + 0.5 times pi of x3,

Â okay? Because there, if you were at pi that,

Â you can end up in x1 in one of two ways. Either by starting out on x1 and staying

Â there, or by starting out at x3 and moving to x1 which happens with

Â probability 0.5. Similarly, pi is X two, you can end up

Â either by being at x2 and staying there, which is this transition or by being at

Â x3 and moving to x2 which happens with probability 0.5.

Â And so this is a set of three simultaneous equations and three

Â variables. this by itself is an underdetermined

Â system because you could multiply all of the pis by a factor of 100 and it would

Â still be a solution. But we can add to that the one constraint

Â that says that all of the pis have to be equal.

Â The sum of all of the pis has to be one, which is because it's a probability

Â distribution. And once you do that you end up with a

Â unique solution to the system of linear equations.

Â And it's not difficult to plug those pis into the system and confirm that indeed

Â this is the stationary distribution that satisfies these equations.

Â By the way, for the grasshopper example that we showed on the previous slide.

Â The stationary distribution is the uniformed distribution.

Â so this has to be one of the dumbest way of generated a sample from the uniform

Â distribution. But it does in fact generate eventually a

Â sample from something that's very closely uniformed distribution.

Â So, when does a Markov Chain converge into a stationary distribution, and I

Â said many of them do but it turns out that not all of them, and so a condition

Â that guarantees convergence to a stationary distribution is something

Â called regularity, so a Markov Chain is regular if the following condition holds.

Â And notice the order of the quantifiers. If there exists found number k, integer k

Â such that for every pair of states. So this is the universal quantifier.

Â The probability of getting from x to x prime in exactly k steps is greater zero.

Â Now, notice what that means. It means that you pick the K first.

Â And only, so you can't have a different K for different pairs of states.

Â You can pick K to be 1000, a million. It doesn't matter for this purpose.

Â But you need to pick it first and then there needs to be a way of getting from X

Â to x primes in exactly that number of steps, with probability greater than

Â zero. It turns out that is a sufficient not a

Â necessary but a sufficient condition to guarantee that the Markov chain converges

Â to a unique stationary distribution regardless of a starts date.

Â So it converges and it converges to a single stationary distribution that's

Â characterized by the equations that we saw on the previous line.

Â 12:10

Now what are some sufficient conditions for regularity because this one is a

Â little bit hard to check. and so one sufficient condition for

Â regularity that people often use in practice is first that every pair of

Â states, x and x prime need to be connected with a path of probability

Â greater than one. Sorry with a probability greater than

Â zero. And, for every state there's a self

Â transition. So you can always transition from a state

Â to a self. So if you think about that, that means

Â that if you take K to be the diameter of this graph.

Â So if you set K to be the the distance between the furthest.

Â Pair of states. Then, you can get, between every pair of

Â states, using exactly k steps. Because you can take less than k, and

Â then just stick around for a while until you hit k, because of the self

Â transitions. So this is a way of guaranteeing that,

Â That, for this value k, you can get from every state to every state.

Â And that's what people typically do. They typically add these self transitions

Â to guarantee regularity. So to summarize, we've defined a notion

Â of a mark up chain. Which defines a general dynamical system,

Â or an iterative sampling process, from which we can sample trajectories that

Â traverse the space of the chain. Under certain conditions, such as the

Â regularity condition that we defined. Which is sufficient, although not

Â necessary. This iterative sampling process is

Â guaranteed to converge to a stationary distribution at at the limit.

Â That is, after we generate enough samples.

Â And so this provides us with a general approach to sample from a distribution

Â indirectly. Which means that if we have a

Â distribution from which it's intractable to sample this provides us with an

Â alternative mechanism.

Â