0:03

So, we have just proved that the Gibbs sampling scheme indeed gives you

Â a correct way of sampling from the desired distribution.

Â So the underlying Markov chain indeed converges to the distribution B.

Â But let us look at a small demo of how it can work in practice.

Â So, let's look at this simple two-dimensional distribution which looks like a Gaussian.

Â And note that we actually don't need Gibbs sampling here.

Â Because first of all, this distribution is two dimensional.

Â So the rejection sampling scheme can also be applied here because the conditionality is

Â not large and also this is Gaussian.

Â So we can efficiently sample from it without too much trouble of any fancy techniques.

Â But this is only for illustrational purposes.

Â Gibbs sampling works even in million dimensional spaces with complicated distributions.

Â So let's start with point 0,0, initialization point.

Â So first of all, on the first substep of the first iteration,

Â Gibbs sampling suggests you to find the conditional distribution on X2 given X1 and 0

Â and it will look like this here because if we know that X2 is 0,

Â then X1 should be somewhere around

Â 5 or 6 because of the shape of our Gaussian distribution.

Â It has more points in that region than in other regions for this particular area of X2.

Â Let's sample a point from that.

Â It can look like this for example.

Â Okay, so this is the first substep.

Â On the next substep,

Â we have to compute the conditional distribution on

Â X2 given that X1 is the one we have just generated.

Â So X1 is 6 or somewhere around there and

Â the conditional distribution on X2 will be like this.

Â Again, it is just the position of the points,

Â positions where our Gaussian distribution

Â lies if they are X1 according to 6.

Â So let's sample from that and get a point like this.

Â This is the result of our iteration one.

Â So this is the next point in our Markov chain process.

Â Note that here we have not converged yet.

Â We even did not find a region of space where the density of our distribution is less.

Â We are somewhere in the low density region.

Â So we cannot use these samples yet because they are not from the true distribution.

Â So let us proceed.

Â Next iteration, we again sample from the conditional distribution on X1 given that

Â X2 is around 1 and this is somewhere here.

Â Okay, let's sample from it and let's move to this point and this is

Â the end of our substep one of the second iteration and on substep two,

Â we will generate a point like this again from the conditional distribution.

Â So one more substep and the end of

Â the first iteration and finally if we repeat this process for like 10 more iterations,

Â we will get points like this which are already it looks like something from the Gaussian.

Â So, this was a demo of how Gibbs sampling works and to summarize its properties.

Â So first of all,

Â the Gibbs sampling idea is to,

Â instead of one complicated problem

Â of generating samples from multidimensional distribution,

Â which you may know after normalization constant,

Â it reduces to a sequence of one-dimensional sampling problems and this is very nice.

Â It makes everything more efficient.

Â But note that if, in the first video this week,

Â we discussed some schemes where we can generate a point and that is it,

Â we have just one sample by spending some time on generating it,

Â here we have a chain of samples.

Â So you cannot generate the first sample unless you for example converge.

Â So we have to wait for these first five hundred iterations and then throw

Â away the first samples to get just one sample that you care about.

Â Another positive feature of this Gibbs sampling is that it is really easy to implement.

Â So just a few lines of coding in pattern.

Â And it is really universally applicable.

Â So almost always you can sample from

Â one-dimensional conditional distributions and

Â this way you can sample from the multidimensional one.

Â And the negative feature is that it can give you really correlated samples.

Â So if you recall the demo,

Â all samples were kind of close to the previous one.

Â So you cannot do large steps here in the Gibbs scheme.

Â You are always doing some kind of local steps

Â because you are going one dimension at a time.

Â And this means that your samples are similar to each other which in turn means

Â that you are not that efficient in estimating expected values as you would wish.

Â So let us say that we waited for the first

Â 1,000 iterations of the Gibbs sampling and throw these points away

Â saying that this was useless for us because the Markov chain did not converge yet.

Â And then after the 1,001 iterations,

Â we are starting to collect samples and write them

Â down to use for our expected value estimation.

Â And for example, we may say that we collected 1,000 samples.

Â Well we may think that we have 1,000 samples,

Â but they are so close to each other that in effect it is like you

Â had one hundred for example one hundred independent samples.

Â So the efficiency of estimating expected phases with correlated samples is much lower

Â than the efficiency of estimating the same thing with

Â the uncorrelated samples and this can be a problem.

Â So it is an efficiency problem.

Â One other negative feature is that because of the small steps,

Â it can be hard to converge.

Â The convergence here in Markov chains is sometimes called mixing.

Â So you have to do lots of steps to find

Â this high density region of your probability distribution

Â and you have to throw all these points

Â away because you have not converged yet and you cannot use these points.

Â So because of these small local steps,

Â you have to wait for convergence longer than you would have if the steps were large

Â and finally as we have already discussed,

Â this method is not parallel and if you want to

Â sample from like one million dimensional space,

Â you have to anyway sample each dimension one at a time,

Â which can be really slow to do.

Â So in the next video,

Â we will discuss some other schemes for building Markov chains that

Â can be used to mitigate some of these problems.

Â