So far we've talked about MRF and CRF learning purely in the context of maximum likely destination where our goal is to optimize value likelihood function or the conditional likelihood function. But as for Bayesian networks maximum likely destination is not a particularly good regiment and it's very susceptible to over-fitting of the parameters to the specifics of the training data. So what we'd like to do is we'd like to utilize some ideas that we, exploited also in the context of Bayesian network. Which are, ideas such as parameter priors to soothe out our estimates of the parameters, at least in initial phases. Before we have a lot of data to really drive us into the right region of the space. So in the context of Bayesian networks that was all great, because we could have a conjugate prior, on, on the parameters, such as the derscht laid prior that we could then, integrate in with a likelihood to obtain a close form conjugate posterior and it was all computationally elegant. But in the context of MRF's and CRF's, even the likelihood itself is not computationally elegant and can't be maintained in closed form. And so, therefore the posterior is also not something that's going to be computationally elegant. And so the question is, how do we then incorporate ideas such as priors into MRF and CRF learning, so as to get some of the benefits of regularization? So the ideal here in this context, is to use what's called map estimation, where we have a prior, but we're instead of maintaining a postiariary is close form, we're computing what's called the maximum, postierory estimates of the parameters. And, this in fact is the same notion of map, that we saw when we did map inferencing graphical models, where we were computing a single map assignment. Here continue in this thread of Bayesian learning as of form inference we're computing a map, inference estimate of the parameters. So concretely how is map inference implemented in the context of an MRF or CRF learning. A very typical solution, is to define a, Gaussian distribution over each parameter theta i separately, And that is usually a zero mean, uni-variant Gaussian. [SOUND] with some variance. sigma^2. And, the variance of sigma squared dictates how firmly we believe that the parameter is close to zero. So for, small variances we are, very confident that the parameter is close to zero and are going to be unlikely to be swayed by a limited amount as data, whereas sigma gets larger we're going to be more inclined to believe, believe the data early on and move the parameter away from zero. So we have such a parameter prior over each data i separately, and they're multiplied together to give us a joint parameter prior. So the parameter prior over each is going to, over each parameter is going to look like this. This sigma^22. is called a hyperparameter. And it's exactly the same kind of beast as we had for the Dirichlet hyperparameters in the context of learning these in the works. The alternative prior that's also in common use, is what's called the Laplacian Parameter Prior. And the Laplacian Parameter Prior looks kind of similar to the Gaussian, in that, it has an exponential that, increases as the parameter moves away from zero. But, in this case, the increase in the parameter depends on the absolute value of theta i and not on theta i^22, which is what, the behavior that we would have with the Gaussian. And so this function looks, looks as we see over here, with a much sharper peak, around zero, that effectively corresponds to a discontinuity, at theta i equals, equals zero. And, we have again such a prior, a Laplacian prior to theta i over each of the parameters, of theta i which are multiplied together. Just like the Gaussian, this distribution has a hyperparameter. Which in this case is often called beta. And the hyperparameter just like the variance, of, in the Gaussian distribution Sigma squared, dictates how tight this distribution is around zero. Where tighter distributions correspond to cases where the model is going to be less inclined to move away from zero, based on limited amount of data. So now, let's consider what map estimation would look like in the context of these two distributions. So here we'd have these two parameters priors rewritten, the Gaussian and the Laplacian. And, now map estimation corresponds to the arg max over theta of the joint distribution, P of D comma theta, so we're trying to maximize, oh, we're trying to find the, theta that maximizes this joint distribution. And by the simple rules of, probability theory, this joint distribution is the product of P of D of given theta which is our likelihood. Multiplied by, the prior p theta. And, because of the linearity of the, not linearity, sorry, the lognicity of the logarithm function, we have the, this is the same as doing arc max of theta, of our log likelihood function. Which is just the log of the likelihood. And the log of, the parameter prior. So now looks at what these logarithms look like in the context of these two priors. And so this is, I'm actually drawing the negative log of p of theta and the negative log of p of theta for a Gaussian Distribution is simply a quadratic. And so we have a quadratic penalty that pushes the parameters towards zero. And this is the same for those of you who've seen this in the machine learning class, for example as doing one regularization. Over the parameter theta i. The distribution that, if you now do the same for the Laplacian distribution, we're going to get. the penalty that pushes the parameter towards zero in a way that depends on the absolute value of the parameter. And, this is equivalent to L1 regularization which again, some of you might have seen in a machine learning class. So this is a linear penalty. Now these penalties although they both push the parameters towards zero. Are actually quite different in terms of the behavior. So if we look at what l two does. In, when the parameters far away from zero, sort of towards the edges. The quadratic penalty's actually quite severe and it pushes the parameter down dramatically. But as the parameter gets small, so closer to the case of theta I equals zero, the pressure to move down decreases. So, effectively you're, it's quite legitimate for the model to have a lot of parameters that are close to zero but not quite zero and there isn't a lot of push at the point to get the parameters to hit to zero exactly. And, so that means that models that use L2 regularization are going to be dense. That is, a lot of the theta I's are going to be non-zero. Conversely when you look at the L1 regularization you see that there's a consistent push, towards zero regardless of where in the parameter space we are. And so it's in the interest of the model to continue pushing parameters that don't impact significantly, the log likelihood component and push them towards zero. And so for L1 regularization we're going to get models that are sparse, and sparsity has it's own value. Because the sparser the model the easier. It is to the inferences in general, because sparsity corresponds to, in principle removing edges or removing features from the graphical model. Which in turn can remove edges, and make inference more efficient. And so L1 regularization or the corresponding Laplacian penalty is often used to make the model both more comprehensible as well as computationally more efficient. So to summarize, when we're doing learning in undirected models. The parameter coupling that we have in the likelihood function prevents us from doing Bayesian destination efficiency. That is maintaining a posterior distribution over parameters. However, we can still use the parameter as a notion of parameter priors to avoid the over fitting that we would get with maximum likelihood destination. And specifically, we're going to use map, as a substitute for MLE. The most typical priors used in this context are L1 and L2, or Gaussian and Laplacian And both of these drive the parameters towards zero which has the effect of preventing the model from going wild and using very drastic parameter values that allow us to over fit the specific training data that we've seen. Now while this behavior is shared among these two models there's also a significant difference between them in that the L1 or Laplacian distribution can be shown both empirically and provably to induce sparse solutions and therefore it performs for us a form of feature selection or equivalently structured learning in the context of MRFs and CRFs.