0:03

Hi, and welcome to week four of practical Bayesian methods.

Â This week, we're going to cover

Â the Markov chain Monte Carlo which

Â is kind of a silver bullet of probabilistic programming,

Â because it allows you to train or to

Â do imprints on almost any model without too much trouble.

Â We'll discuss some extensions of this technique.

Â So how to make it faster in some particular problems,

Â by exposing the structure of the problems,

Â and we'll also discuss some limitations,

Â like the speed of the method.

Â So, let's start the discussion of Markov chain Monte Carlo with the Monte Carlo part.

Â Monte Carlo was originated in the Manhattan Project which gave us

Â the atomic bomb and it was a quick and dirty answer to the question,

Â what to do until the mathematicians arrives.

Â The idea of Monte Carlo simulation is to,

Â so if you have a complicated model

Â of some problem and you want to predict something about this model,

Â instead of thinking carefully and,

Â you know, writing down equations and solving them,

Â you may just simulate your model lots and lots of times and then see,

Â for example, the average of the simulation

Â as some kind of smooth prediction of what will happen.

Â So, probably the reason why this thing was

Â invented in Manhattan Project is because it's one of

Â the first really huge scientific projects where we already

Â had lots of means to approximate integrals and to do these kind of simulations.

Â And second of all, we already had computers in some reasonably useful state,

Â where we can actually assimilate stuff,

Â not mentally but automatically.

Â And the Monte Carlo algorithm turned out to be pretty efficient,

Â and it made it to the list of top 10 algorithms of 20th century.

Â So the name Monte Carlo was given by the name of the city

Â Monte Carlo which is famous for its casinos,

Â and probably because, you know,

Â everything in Manhattan Project had to have its code name.

Â So let's see a small example of what I'm talking about.

Â So, what is Monte Carlo applied to some really simple problem.

Â Say you want to approximate the value of pi,

Â and you know that pi is the area of the unit circle,

Â a circle with radius one.

Â So we can, kind of,

Â use this information to approximate it.

Â And the idea here is that if you consider a rectangle from zero to one on both axes,

Â then this quarter of a circle which appears in this rectangle,

Â has exactly the area pi divided by four.

Â It's just one-fourth of the circle.

Â And its area equals to some integral which basically says

Â at each point whether this point is in the circle or not.

Â So if we integrate this thing along this rectangle, we'll get the area.

Â So how can we compute this integral? It's hard.

Â But we can throw lots and lots of points on

Â this unit rectangle and just see the fraction of those points that are put in the circle,

Â in this quarter of the circle,

Â and use this as an approximation to what the integral really is.

Â So, you can see that

Â this maybe not the best method to approximate the value of pi because here,

Â for example, you spend like 3000 samples.

Â So you assemble 3000 points and the pi was approximate as 3.1,

Â which is not the best accuracy you can get with this much computations.

Â But anyway, the method is really simple.

Â It's like two lines of code in almost any programming language.

Â And this is the general theme of Monte Carlo simulations.

Â So, Monte Carlo usually gives you something which is really easy to program,

Â which is really universally applicable to lots of and lots of problems,

Â which is very scalable to parallelization.

Â So, if you have like 100 computers,

Â then you can really easily parallelize this thing to be

Â 100 times more efficient which is not

Â the case for many other algorithms which are much harder to parallelize.

Â And sometimes this Monte Carlo method can be slow compared to other alternatives.

Â So, usually, if you sit down carefully for a weekend,

Â like write down lots of equations and think,

Â then you may come up with a better way to do,

Â to solve your problem than to do just Monte Carlo simulation.

Â Sometimes not, but sometimes it's

Â just a quick and dirty approach which gives you the first approximation.

Â But anyway, it's very useful.

Â So, to be a little bit more general about this family of techniques,

Â let's say that we want to approximate some expected value of some function f,

Â with respect to some distribution p of x.

Â And we will use,

Â we will sample a few points for some end points from this distribution p of x,

Â and use the average of f of x size as the approximation.

Â So, here, we are kind of using sampled average instead of the expected value.

Â And this thing has lots of nice various co-properties like it's unbiased,

Â meaning that if you sample enough points,

Â you can get closer and closer to the actual true answer,

Â at least with high probability.

Â And anyway, we're [inaudible] how fast you will converge to that.

Â So, this is Monte Carlo.

Â And you may ask a question like, why do we care?

Â Why do we need to approximate expected values?

Â Well, in permalink programming,

Â there are probably a few reasons to do that.

Â Let's consider two of them.

Â So first of all, if you wanted to do full Bayesian inference,

Â like we covered a little bit in week one, then,

Â instead of having some pragmatic model and finding the best failure of parameters,

Â you may just want to tweak your parameters as latent variables.

Â And then for a new object X,

Â if you want to predict it's label,

Â for example, and you have some training data set x train and y train.

Â You may want to just integrate out this latent variable like this.

Â So, imagine that, for example,

Â x is image and you want to predict like whether it's an image of a cat or dog.

Â And then P of Y given X and W maybe for example a neural network.

Â With weights W and which takes

Â as inputs this image x and outputs your distribution over labels y.

Â And then, instead of using this just one neural network with some kind of

Â optimal or best set of weights

Â w you're considering all possible neural networks with this architecture.

Â So, for each possible value for weight's w,

Â you consider this neural network.

Â You pass your image through this neural network.

Â You write down the answers the P of Y and then you average all these predictions with

Â some weights which are given by

Â the Pasteur's division the weights P of W given the training data set.

Â So it's like an infinitely large ensemble of neural networks.

Â Where the weights of the ensemble are given by

Â the Pasteur's distribution P of W given the training data set.

Â And this P of W given the training data set gives you some idea about how well

Â these weights will behave as a weight for your neural network.

Â So this value will be higher

Â for reasonable weights and lower for the weights which doesn't make sense.

Â So this way, you're doing full base in inference.

Â You're having a few benefits of probabilistic modeling like you can,

Â for example, predict uncertainty in your predictions.

Â And well basically, instead of just fixed value of weights,

Â you have a distribution over weights.

Â And this thing is, of course, intractable.

Â So if you have a complicated function like in neural network here,

Â you can compute this integral analytically.

Â So you have to do some approximation.

Â And one of the way to do it is to just say that this thing

Â is just an expected value of this neural network.

Â So we have a neural network output and the computing

Â the expected failure of this neural network output

Â with respect to some distributional weights.

Â And so, we can approximate this thing with Monte Carlo.

Â If we can sample from P of W given the training data set.

Â So this is one example,

Â where the Monte Carlo can be

Â useful to approximate expected value in the probabilistic modeling.

Â And you may ask how we can

Â find this posterior distribution of the weights P of W in the training data set?

Â Well the posterior distribution is proportional to the joint distribution.

Â So likelihood P of Y train given X train and

Â W times the prior and W and divided by some normalization constant.

Â And so the numerator here is easy because you defined your model yourself.

Â You could have, for example,

Â defined P of W the prior to be

Â normal and the likelihood is just your output of your neural network.

Â So you can compute this thing for any values of Y, X,

Â and W. But then,

Â the denominator, the normalization constant Z is much harder.

Â It contains something to go on site so you can't ever compute it.

Â So it's kind of a general situation where you know

Â your distribution from which you want to sample but only up to normalization constant.

Â Right? And other example of these kind of

Â expected values approximation can be useful in probabilistic modeling

Â is the M-step of the expectation maximization algorithm which we covered in week two.

Â So, on the E step of expectation maximization,

Â we found the posterior distribution latent variable T given the data set and parameters.

Â And on the M-step,

Â we want to maximize the expected value of algorithm of the joint probability.

Â With respect to the parameters theta and

Â the expected value with respect to this posterior distribution Q.

Â So, here again, if the posterior distribution Q is too hard to work with,

Â and we can't integrate this thing analytically, we can do a few things.

Â We can, first of all,

Â we can approximate Q to be some to lay in some simple class like factorize distribution.

Â And that's what we did in the previous week,

Â week three about variational inference.

Â So we're approximating the posterior and then

Â computing these integral analytically for the simplified posterior.

Â Or we can use the exact posterior but we

Â can sample some data set of samples from this set

Â of latent variable values and

Â then approximate this expected value with just average with respect

Â to these samples and

Â then maximize this approximation instead of the true expected values.

Â So these are two examples of where you can need expect [inaudible] and

Â probabilistic modeling and how Monte Carlo simulation can help you here.

Â And in the following videos,

Â we are going to use just,

Â we're going to use the formulation of this problem is as like we want to learn how

Â to sample from a complicated probabilistic distribution

Â which we know up to normalization constant.

Â And we will always write like p of x meaning that x is the parameters you want to sample.

Â So here we had,

Â in the first example of full Bayesin interference we had p of w giving some data.

Â And in this second example we have p of latent variable giving some again data.

Â But in all cases, the full ingredients,

Â we will just write down,

Â you know, p of x meaning that it [inaudible] distribution we want to sample from.

Â