0:03

Okay. So, we decided to model our distribution

Â through facts by using the continuous mixture of Gaussians.

Â So, let's develop this idea.

Â To define this model fully,

Â we have to define the prior and the likelihood.

Â And let's define the prior to be just standard norm, because, why not.

Â It will just force the latent variables t

Â to be around zero and with some unique variants.

Â And the likelihood, we decide that we will use Gaussians, right?

Â With parameters that depend on t somehow.

Â So, how can we define these parameters,

Â these pro-metric way to convert t to the parameters of the Gaussian?

Â Well, if we use linear function for Mu of t with some parameters w and b and a constant

Â for sigma of t. Which this Sigma zero can

Â be a parameter or maybe like all these identity matrix, it doesn't matter that much.

Â We'll get the usual PPCA model.

Â And, this probabilistic PPCA model is

Â really nice but it's not powerful enough for our kinds of natural images data.

Â So, let's think what can we change to make this model more powerful.

Â If a linear function is not powerful enough for our purposes,

Â let's use convolutional neural network because it works nice for images data.

Â Right? So, let's say that Mu of t is

Â some convolutional neural network apply it to the latent called

Â t. So it gets as input the latent t and outputs your image or a mean vector for an image.

Â And then Sigma t is also a commercial neural network which

Â takes living quarters input and output your covariance matrix Sigma.

Â This will define our model in some kind of parametric form.

Â So we have them all like this.

Â And let's emphasize that we have some weights and then you'll input

Â w. Let's put them in all parts far off our model definitions.

Â Do not forget about them.

Â We are going to train the model to have them all like this.

Â So pre-meal to facts given the weights of neuron that are w is a mixture of Gaussians,

Â where the parameters of the Gaussians depends on

Â the leading variable t for a convolutional neural network.

Â One problem here is that if for example your images are 100 by 100,

Â then you have just 10000 pixels in each image and it's pretty low resolution.

Â It's not high end in anyway,

Â but even in this case,

Â your covariance matrix will be 10,000 by 10,000. And that's a lot.

Â So we want to avoid that and it's not so reasonable to

Â ask our neural network to output your 10,000 by 10,000 image, or matrix.

Â To get rid of this problem let's just say that our covariance matrix will be diagonal.

Â Instead of outputting the whole large matrix Sigma,

Â we'll ask our neural network to produce

Â just the weights on the diagonal of this covariance matrix.

Â So we will have 10,000 Sigmas here for example and we will

Â put these numbers on the diagonal of

Â covariance matrix to define the actual normal distribution,

Â or condition on the latent variable t. Now our conditional distributions are vectorized.

Â It's Gaussians with zero off diagonal elements in the covariance matrix, but it's okay.

Â Mixture of vectors as Gaussian is not a factor as distribution.

Â So we don't have much problems here.

Â We have our model fully defined,

Â now have to train it somehow.

Â We have to train.

Â The natural way to do it is to use maximum likelihood estimation

Â so to maximize the density of our data set given the parameters;

Â the parameters of the conventional unit neural network.

Â This can be redefined by a sum integral where we marginalize

Â out the latent variable t. Since we have a latent variable,

Â let's use expectation maximization algorithm.

Â It is specifically invented for these kind of models.

Â And in the expectation maximization algorithm,

Â if you recall from week two,

Â we're building a lower bond on the logarithm of this marginal likelihood,

Â P of x given w and we are lower modeling

Â this value by something which depends on w and some new variational parameters Q.

Â And then we'll maximize this lower balance with respect to

Â both w and q to get this lower bound

Â as high as possible as accurate so as close to the

Â actual lower for the margin look like what is possible.

Â And the problem here is that when you step off of the play

Â an expectation maximisation algorithm we have to use we

Â have to find the best years original latent variables.

Â And this is intractable in this case because you have to compute

Â some integrals and your integrals contains convolutional neural networks in them.

Â And this is just too hard to do analytically.

Â So E-M is actually not the way to go here. So what else can we do?

Â Well in the previous week we discussed the Markov chain Monte Carlo and we can

Â use we can use this MCMC to approximate M-step of the expectation maximisation.

Â Right. Well. This way on the amstaff

Â we instead of using the expected value with respect to the Q.

Â Which is in the posterior distribution on the latent variables from

Â the previous iteration in that we will approximate this expected value with samples,

Â with an average and then we'll maximize this iteration instead of the expected value.

Â It's an option we can do that.

Â Well it's going to be kind of slow because this way on each iteration of

Â expectation optimization you have to run like hundreds of situation of Markov chain.

Â Wait until have converged and then start to collect samples.

Â So this way you will have kind of a mess that loop.

Â You will have all the reiterations of expectation maximisation and iterations of

Â Markov chain Monte Carlo and this will probably not be very fast to do.

Â So let's see what else can we do.

Â Well we can try variational inference and the idea of variational inference

Â is to maximize the same lower bound

Â but to restrict the distribution you do be vectorized.

Â So for example if the later they will charge for each data object

Â is 50 dimensional then this Q

Â I of T I will be just a product of

Â 50 one dimensional distributions so it's a nice way to go,

Â it's a nice approach.

Â It will approximate your expectation maximisation but it usually works and pretty fast.

Â But it turns out that in this case even this is intractable.

Â So in this approximation is not enough to get

Â an efficient method for training your latent variable model.

Â And we have to approximate even further.

Â So we have to drive even less accurate approximation to be

Â able to build an efficient method for treating this kind of model.

Â