0:00

So far we've talked about MRF and CRF learning purely in the context of maximum

Â likely destination where our goal is to optimize value likelihood function or the

Â conditional likelihood function. But as for Bayesian networks maximum

Â likely destination is not a particularly good regiment and it's very susceptible

Â to over-fitting of the parameters to the specifics of the training data.

Â So what we'd like to do is we'd like to utilize some ideas that we, exploited

Â also in the context of Bayesian network. Which are, ideas such as parameter priors

Â to soothe out our estimates of the parameters, at least in initial phases.

Â Before we have a lot of data to really drive us into the right region of the

Â space. So in the context of Bayesian networks

Â that was all great, because we could have a conjucate prior, on, on the parameters,

Â such as the derscht laid prior that we could then, integrate in with a

Â likelihood to obtain a close form conjucate posterior and it was all

Â computationally elegant. But in the context of MRF's and CRF's,

Â even the likelihood itself is not computationally elegant and can't be

Â maintained in closed form. And so, therefore the posterior is also

Â not something that's going to be computationally elegant.

Â And so the question is, how do we then incorporate ideas such as priors into MRF

Â and CRF learning, so as to get some of the benefits of regularization?

Â So the ideal here in this context, is to use what's called map estimation, where

Â we have a prior, but we're instead of maintaining a postiariary is close form,

Â we're computing what's called the maximum, postierory estimates of the

Â parameters. And, this in fact is the same notion of map, that we saw when we did

Â map inferencing graphical models, where we were computing a single map

Â assignment. Here continue in this thread of Bayesian learning as of form inference

Â we're computing a map, inference estimate of the parameters.

Â So concretely how is map inference implemented in the context of an MRF or

Â CRF learning. A very typical solution, is to define a,

Â Gaussian distribution over each parameter theta i separately, And that is usually a

Â zero mean, uni-variant Gaussian.

Â [SOUND] with some variance. sigma^2.

Â And, the variance of sigma squared dictates how firmly we believe that the

Â parameter is close to zero. So for, small variances we are, very

Â confident that the parameter is close to zero and are going to be unlikely to be

Â swayed by a limited amount as data, whereas sigma gets larger we're going to

Â be more inclined to believe, believe the data early on and move the parameter away

Â from zero. So we have such a parameter prior over

Â each data i separately, and they're multiplied together to give us a joint

Â parameter prior. So the parameter prior over each is going

Â to, over each parameter is going to look like this.

Â This sigma^22. is called a hyperparameter.

Â And it's exactly the same kind of beast as we had for the Dirichlet

Â hyperparameters in the context of learning these in the works.

Â The alternative prior that's also in common use, is what's called the

Â Laplacian Parameter Prior. And the Laplacian Parameter Prior looks

Â kind of similar to the Gaussian, in that, it has an exponential that, increases as

Â the parameter moves away from zero. But, in this case, the increase in the

Â parameter depends on the absolute value of theta i and not on theta i^22, which

Â is what, the behavior that we would have with the Gaussian.

Â And so this function looks, looks as we see over here, with a much sharper peak,

Â around zero, that effectively corresponds to a discontinuity, at theta i equals,

Â equals zero. And, we have again such a prior, a

Â Laplacian prior to theta i over each of the parameters, of theta i which are

Â multiplied together. Just like the Gaussian, this distribution

Â has a hyperparameter. Which in this case is often called beta.

Â And the hyperparameter just like the variance, of, in the Gaussian

Â distribution Sigma squared, dictates how tight this distribution is around zero.

Â Where tighter distributions correspond to cases where the model is going to be less

Â inclined to move away from zero, based on limited amount of data.

Â So now, let's consider what map estimation would look like in the context

Â of these two distributions. So here we'd have these two parameters

Â priors rewritten, the Gaussian and the Laplacian.

Â And, now map estimation corresponds to the arg max over theta of the joint

Â distribution, P of D comma theta, so we're trying to maximize, oh, we're

Â trying to find the, theta that maximizes this joint distribution.

Â And by the simple rules of, probability theory, this joint distribution is the

Â product of P of D of given theta which is our likelihood.

Â 5:25

Multiplied by, the prior p theta. And, because of the linearity of the, not

Â linearity, sorry, the lognicity of the logarithm function, we have the, this is

Â the same as doing arc max of theta, of our log likelihood function.

Â Which is just the log of the likelihood. And the log of, the parameter prior.

Â So now looks at what these logarithms look like in the context of these two

Â priors. And so this is, I'm actually drawing the

Â negative log of p of theta and the negative log of p of theta for a Gaussian

Â Distribution is simply a quadratic. And so we have a quadratic penalty that

Â pushes the parameters towards zero. And this is the same for those of you

Â who've seen this in the machine learning class, for example as doing one

Â regularization. Over the parameter theta i.

Â The distribution that, if you now do the same for the Laplacian distribution,

Â we're going to get. the penalty that pushes the parameter

Â towards zero in a way that depends on the absolute value of the parameter.

Â And, this is equivalent to L1 regularization which again, some of you

Â might have seen in a machine learning class.

Â So this is a linear penalty. Now these penalties although they both

Â push the parameters towards zero. Are actually quite different in terms of

Â the behavior. So if we look at what l two does.

Â In, when the parameters far away from zero, sort of towards the edges.

Â The quadratic penalty's actually quite severe and it pushes the parameter down

Â dramatically. But as the parameter gets small, so

Â closer to the case of theta I equals zero, the pressure to move down

Â decreases. So, effectively you're, it's quite

Â legitimate for the model to have a lot of parameters that are close to zero but not

Â quite zero and there isn't a lot of push at the point to get the parameters to hit

Â to zero exactly. And, so that means that models that use

Â L2 regularization are going to be dense. That is, a lot of the theta I's are going

Â to be non-zero. Conversely when you look at the L1

Â regularization you see that there's a consistent push, towards zero regardless

Â of where in the parameter space we are. And so it's in the interest of the model

Â to continue pushing parameters that don't impact significantly, the log likelihood

Â component and push them towards zero. And so for L1 regularization we're going

Â to get models that are sparse, and sparsity has it's own value.

Â Because the sparser the model the easier. It is to the inferences in general,

Â because sparsity corresponds to, in principle removing edges or removing

Â features from the graphical model. Which in turn can remove edges, and make

Â inference more efficient. And so L1 regularization or the

Â corresponding Laplacian penalty is often used to make the model both more

Â comprehensible as well as computationally more efficient.

Â So to summarize, when we're doing learning in undirected models.

Â The parameter coupling that we have in the likelihood function prevents us from

Â doing Bayesian destination efficiency. That is maintaining a posterior

Â distribution over parameters. However, we can still use the parameter

Â as a notion of parameter priors to avoid the over fitting that we would get with

Â maximum likelihood destination. And specifically, we're going to use map,

Â as a substitute for MLE. The most typical priors used in this

Â context are L1 and L2, or Gaussian and Laplacian And both of these drive the

Â parameters towards zero which has the effect of preventing the model from going

Â wild and using very drastic parameter values that allow us to over fit the

Â specific training data that we've seen. Now while this behavior is shared among

Â these two models there's also a significant difference between them in

Â that the L1 or Laplacian distribution can be shown both empirically and provably to

Â induce sparse solutions and therefore it performs for us a form of feature

Â selection or equivalently structured learning in the context of MRFs and CRFs.

Â