0:03

So, let's summarize what we learned so far.

Â We discussed Markov Chain Monte Carlo methods,

Â and the Monte Carlo part is about approximating expected values by sampling.

Â And the main question here is how to sample.

Â How can we generate samples from

Â a complicated distribution which we may know up to normalization constant,

Â which usually happens in probabilistic modeling?

Â To do that we discussed two main methods.

Â Both of them are coming from the Markov Chain Monte Carlo family,

Â and the idea of Markov Chain Monte Carlo is to build a dynamic system,

Â such that if you simulate it long enough,

Â it's state start to look like samples from a desired distribution.

Â In this dynamic system called Markov Chain,

Â we discussed two ways to build a Markov Chain that

Â converges to your distribution you want to sample from.

Â The first method here is Gibbs sampling,

Â which reduces the problem of sampling from

Â multidimensional distribution to a sequence of one dimensional samplings.

Â And one dimensional samplings are usually much easier

Â because we've first of all some efficient methods

Â for one dimensional sampling cyclic chain all purpose methods like rejection sampling.

Â Second of all, sometimes you can even compute the normalization constant,

Â because integrating things in one mention is much easier.

Â And finally, sometimes you can even find

Â analytical solutions to this one dimensional sampling problem.

Â If your original multidimensional sampling is structured enough.

Â But we discussed some downsides to this Gibbs sampling,

Â is that, is generate highly correlated updates.

Â And this means that it converge slowly and that you have to use lots

Â and lots of samples to approximate your expected value accurately enough.

Â The second method we discussed here is Metropolis Hastings,

Â and it gives you a whole family of Markov Chains,

Â which are all converged to the desired distribution you want,

Â and it kind of gives you more freedom,

Â you can choose, right?

Â But with this freedom comes great responsibility. You have to choose wisely.

Â If you choose this Markov Chain to be not well suited for your purpose like,

Â if you choose Markov Chain that converge slowly or has again highly correlated samples,

Â then you will waste lots of resources and

Â the methods available will not be very efficient.

Â So let's put this Markov Chain Monte Carlo methods in the context.

Â Let's compare it to the variational inference we covered in the previous week.

Â Again the Monte Carlo is about approximating the expected values.

Â So substituting them with average with

Â respect to samples you obtained from your distribution.

Â And the main property here is that this approximation is unbiased.

Â So, the more you wait,

Â the more accuracy you get,

Â and you can get this accuracy as low as you want if you're willing to wait long enough.

Â If you want to express this property a little bit more mathematically,

Â you can write down like this.

Â So, this average, this approximation is a random variable itself, right?

Â So if you repeat this process of obtaining samples and

Â then averaging them to get an approximation,

Â you'll get another approximation.

Â So to run a variable,

Â [inaudible] different approximations on different samples.

Â And an estimate like this is called unbiased,

Â if its expected value is equal to the thing you want to approximate.

Â So, on average your approximation is correct.

Â And if you wait long enough this approximation will converge to the right answer.

Â And this is something that is not true for Variational Inference.

Â So, in Variational Inference,

Â we're trying to approximate the distribution we

Â want to sample from or we want to work with,

Â with distribution from some simple family,

Â like for example of family of factorised distributions.

Â And of course in iterations, Variational Inference will become better and better.

Â Because so you think better and better approximation of

Â your distribution in this class of factorised distributions.

Â But you can't get your error to zero,

Â because you can never approximate

Â your complicated distribution arbitrarily well with a factorized one.

Â So, a typical picture for this describing distribution is like this.

Â You have your problem and you are trying to solve it with Markov Chain Monte Carlo,

Â and it kind of works and gives you a solution which is

Â better and better and better with time and it approaching to zero error.

Â Then you use Variational Inference,

Â it usually gives you a nice solution much faster but it stagnates.

Â It can't get past some critical value fairer and can get you to zero.

Â So, to put this Markov Chain Monte Carlo even more in the context,

Â we can recall the table we had in the end of week three about Variational Inference.

Â The table of approximate method store solve for

Â probabilistic modeling problems we have covered so far.

Â So first of all, there is Full Bayesian Inference,

Â where you model all your parameters and latent variables as latent variables.

Â So, you don't train anything in the usual sense.

Â You're not trying to find the best fitters for parameters,

Â but instead you're trying to marginalize how they'll think you don't know.

Â This one theoritically get the best results.

Â But on most for practical models,

Â this Full Bayesian Inferences isn't feasible,

Â and you have to do some approximations.

Â And one of them is to do a midfield or variational inference.

Â Where you assume that your posterior distribution is factorized,

Â and you're trying to approximate

Â your posterior distribution in these factorized family as well as you can.

Â Now we have just added one more method here.

Â We can approximate this Full Bayesian inference

Â by sampling from the posterior distribution,

Â and using these samples to obtain,

Â to estimate the values you care about.

Â So, we can use Gibbs sampling or Metropolis Hastings here and assemble a few points

Â from the posterior distribution so far from

Â the all latent variables and then use them to approximate.

Â Then if you can do in that or if you don't want to because it processes slow,

Â you can use the expectation maximization algorithm.

Â Which basically tries to find the maximum likelihood estimation on the parameters theta,

Â so it doesn't treat theta as latent variable,

Â but instead it treats theta as parameters,

Â and is trying to find the best theta for these parameters by using some iterative scheme.

Â And this expectation maximization algorithm is also sometimes intractable.

Â So, sometimes it's very hard to do E or M step.

Â And in these cases you can do again midfield

Â variational inference with applied to expectation maximization.

Â So basically on the E step,

Â you can approximate your posterior,

Â your variational distribution cue with a factorized one,

Â or again, you can do Markov Chain Monte Carlo for EM.

Â So on the EM step,

Â instead of working with

Â a distribution cue directly

Â and trying to find expected value with respect to this distribution,

Â you can acquire samples from this posterior on the latent variables,

Â and then use the samples to approximate the expected value on the EM step,

Â and maximize this approximation instead of the original expected value.

Â As of the final summary,

Â Markov Chain Monte Carlo is a method that allows you

Â to do training or inferencing probabilistic models,

Â and it's really easy to implement.

Â It's really easy to parallelize at least in terms of like if you have 100 computers,

Â you can run 100 independent cue centers for example on each computer,

Â and then combine the samples obtained from all these servers.

Â And finally, it gives you unbiased estimates.

Â So in principle, if you wait long enough you can get as low accuracy as you may want.

Â On the negative side,

Â sometimes this Markov Chain Monte Carlo methods are slower than the others.

Â But because they're so widely applicable,

Â sometimes is just the only choice.

Â So, in the next videos we will cover

Â a few practical examples of how can you apply Markov Chain Monte Carlo to your problems.

Â