1:23

And so if we now, take that and put it through the log, in order to define the

Â log likelihood and then something that out or multiple data instances, we see

Â that we have the following expression, we have a summation over instances M of, log

Â of. The, pi one of A.

Â B for that instance plus log of phi two of BC for that instance minus log of Z.

Â And here importantly notice that I've included the parameters of the model

Â Theta as an argument to the partition function, which now explains why the

Â partition function rather than a partition constant.

Â It is a function of the parameters and as we change the parameters the partition

Â function changes and and that's going to turn out to be very important.

Â So, now let's continue deriving this expression, and.

Â We can see that just like in the context of Bayesian network we can gather like

Â terms and consider for example how many times a particular entry on the factor

Â phi one of AB is used and it's used in every instance where the variable A takes

Â the value little B and the variable B takes the value little B.

Â And so we can, again, accumulate like terms.

Â And we end up with a summation over here over all possible value combinations.

Â Little a, little b. Of the log of that factor entry, phi1 of

Â ab, multiplied by the counts. A baby.

Â And similarly, we have exactly the same form for the second factor, BC, where for

Â every con, for every entry in the factor phi two of BC, we have a number of

Â occurrences of that, which is simply the counts of BC in our data set.

Â And finally, we have this partition function hanging off the end, which is

Â accumulated once for every data instance. So this looks great so far because we

Â have, as in the complex Bayesian network, we've decomposed the likelihood function

Â or in this case the log like gives function rather, into a bunch of these

Â terms, each of which involved one parameter and a count of how many times

Â the parameter's used. So beautiful decomposition and maybe we

Â can just go ahead and optimize each parameter separately.

Â Except that we have that nasty partition function hanging off the end.

Â And if we look at the form of the partition function, that form is a sum

Â over all possible values of a, b and c of the product of 51ab, 52bc.

Â And one important observation is that when you stick this summation into a log,

Â this, the log doesn't flow through the summation and so you don't get any

Â decomposition of the parameters nicely into isolated pieces.

Â And that is the killer, because the partition function is the term that

Â couples the parameters, and so there's no decomposition of the likelihood into

Â seperate independent pieces, there's parts of it that decompose, but the

Â partition function kills that decomposition.

Â And as a consequence, it turns out that there is no closed form solution for this

Â optimization problem. Unlike in the case of Bayesian networks

Â where we had maximum likelihood estimation having a closed form that you

Â can derive directly from the sufficient statistics which are the data count.

Â So, to see this in action, here's a picture of a projection of a

Â proj, of this of this, probability distribution.

Â So here we have, written the probability distribute, distribution as a function of

Â two. log linear features.

Â These are the indicator functions of a one b one.

Â And the indicator function of B0, C1. And so one of them is an entry in the

Â first potential, in this potential, and the other one is an entry in this

Â potential. And obviously, you could have other

Â features than that, but I can only draw the, a two dimensional feature space, and

Â so, so we focused on just two parameters. And so what you see here for the two axis

Â one axis is the, is one of these parameters, and the second axis is the

Â second of these parameters, so the. Thetas that are the coefficient of these

Â indicator functions. And what you see here is the log

Â likelihood. So this is the surface, of the log

Â likelihood. There are several things that can be seen

Â from looking at the surface. First, it seems, and we'll show this in a

Â moment, that there is a single global optimum of this function that happens

Â right here in the middle of this contour. But, we can also see the coupling between

Â the parameters coming up. That it doesn't decompose nicely as a

Â product of or in this case a summation of a function over one of the dimensions.

Â plus a function over the second dimension.

Â So now let's delve into the math a little bit more and consider what the log

Â likelihood function looks like in the general case of a log linear model.

Â So just as a reminder here is the definition of a log linear model and,

Â it's, defined as, the sum, of a set of coefficients which are parameters theta

Â I. Multiplied by a set of features, which

Â could be indicator features as we had on the previous slide.

Â But we also talked about other forms of features for long linear models in a, in

Â a previous module so we're not going to go back there but any function here would

Â work. We add up all of these log linear

Â functions and exponentiate that and then we have the partition function that makes

Â sure. That we have a distribution.

Â 18:16

So let's think about how to actually perform this maximum likely destination.

Â So let's go back to the log likelihood function, now we're going to add this

Â we're going to multiply by one over m so that we don't have a scaling issue where

Â the log likelihood continues to grow as a number of data instances grows and so

Â that gives us this expression over here. Wjere a note, that the m has disappeared

Â as a multiplier from the log partition function, and now we have a one over m

Â term in the linear component as well. And now if we go back and plug in the

Â derivatives of theta I. we take the derivative of this expression

Â relative to particular parameter theta I, we have two pieces.

Â We have already said that derivative of the log partition function relative to a

Â parameter theta I is the expectation of Theta i, of fi whoch is the cul, which is

Â the feature, that Theta I multiplies, in, piece of thena.

Â On the other hand, if we look here, at the derivative relative to theta I, we

Â can see that what we have here, is also an expectation, it's an imperical

Â expectation, it is the expectation, or average of f i In the data distribution.

Â And so the derivative of the log likelihood relative to theta I is the

Â difference of two expectations. An empirical expectation, and, and

Â expectation in the parametric distribution defined by my current set of

Â parameters, theta. The one immediate consequence from this,

Â is that a parameter setting theta half is the maximum likelihood estimate, if and

Â only if, we have equality of expectations, that is the expectation

Â relative to the model. Theta hat is equal to the expectation in

Â the data b. And so the expectation in the model has

Â to be equal to the expectation in the data.

Â And that has to hold for every one of the features, in my log linear model.

Â Bill Froley. Now that we've computed the gradient

Â let's think about how one might use it to actually do the optimization.

Â So it turns out that a very commonly used approach to, do this kind of likelihood

Â optimization is in the context of CRFs is a variant of gradient ascent.

Â So here we have a likelihood surface which is in, in this case a fairly nice

Â function. And what we're going to do is we're going

Â to use a gradient process where at each point, we compute the gradient.

Â We take a certain step in that direction. And continue to climb up.

Â And the gradient that we've computed on the previous slide is what we're using.

Â Now it turns out that plain vanilla gradient adcent is not actually the most

Â efficient way to go. typically one uses a variant of gradient

Â adcent called, LBFGS, which is a quasi Newton method.

Â So very briefly a Newton method is something that, constructs a second order

Â proximation, to the function, at each point as opposed to just a linear

Â proximation which looks only at the gradient, that says we don't want to

Â spend the computational cost in general to compute a second order, or quadratic

Â approximation to the function, a quasi Newton method.

Â It kind of makes, an approximate guess, at what the step would be, if we were to

Â compute a second order of approximation. And we're not going to go into the

Â details of LDFGS here, but there's plenty of literature on that, that that you can

Â find. Now in either case, for computing the

Â gradient in this context. The critical computational cost is the

Â fact that we need the expected feature counts.

Â And the expected feature counts come up in two places in the data.

Â That is when we compute this empirical expectation E relative to D of the

Â feature FI. But also in the second term in this

Â derivative relative to the current model. So E relative to the parameter vector

Â theta in the current model of the feature fi.

Â Now this second piece is really the rate limiting step, because it requires that

Â at each point in the process, we conduct inference on the model in order to

Â compute the gradient. And so this is an expensive computation,

Â because it requires that we run inference in a graphical model, which we know to be

Â a not inexpensive step, depending on the complexity of the graphical model.

Â So to make this concrete, let's look at an actual example.

Â Let's return to our example of the icing model where the energy function of a set

Â of random variables. X1 up to xn is a, is the product of a

Â bunch of pairwise term, is a sum of a bunch of pairwise terms.

Â WIJXIJ. And a, and a bunch of singleton terms,

Â UIXI and, it's an energy function, so it's negated.

Â and so plugging into the equation that we had before the gradient equation,

Â relative to a particular parameter vector.

Â Theta I, and let's now look at what this looks like for the two types of

Â parameters that we have here. The single parameter is UI, and the

Â pairwise parameter is WIJ. So for the singleton parameters UI, if we

Â just go ahead and and plug into this, this expression over here we're going to

Â have for the empirical expectation simply the average of data XI of M, over over

Â the data instances M. So assuming that each variable, xi, has

Â its own parameter ui, we're going to sum up over all data instances m, and the

Â value of the variable xi. Okay.

Â And that's going to be our empirical expectation once we average it over all

Â of the [INAUDIBLE]. Now what about the

Â the model expectation, expectation relative to theta.

Â Well, here we have two two cases.

Â I, we need to consider the probability that xi is equal 1 and the probability

Â that xi is equal to -1/ And, the case where, xi is equal to one, makes a

Â contribution of +one, to the expectation. And the case where xi is -1 makes a

Â contribution of -one to the expectation. And so altogether we have P theta, of,

Â xi1 = 1 minus p theta of xi-1. = -1.

Â And, that's because the two have, these two different coefficients, +one and -1.

Â A slightly more complicated version of this comes up when we have the peerwise

Â derivative of the derivative relative to the parameter WIJ.

Â And here once again the empirical expectation is simply the, average of XI

Â of M, XJ of M over the data instances M, and, the probability term has four

Â different probabilities corresponding to the four cases, XI and XJ are both one,

Â both, negative one or one takes the value of one and the other negative one vice

Â versa. And, the two cases, of plus one, plus

Â one, an minus one minus one, have a coefficient of one, because of the

Â product XI, XJ is one in those cases and the other two have, coefficients of

Â negative one and so, when end up with this expression P, sub theta of XI equals

Â 1X, J equals one, plus p theta of XI equals negative one, XJ equals negative

Â one, minus the probability of the other two cases.

Â So to summarize. We've seen that the partition function in

Â undirected graphical models coupled all of the parameters in our likelihood

Â function. And that introduces complexities into

Â this setting that we did not see in the context of Bayesian networks,

Â specifically we've seen that this that there's no closed form solution for

Â optimizing the likelihood for these undirected graphical models.

Â On the other hand, we have also seen that the problem is a convex optimization

Â problem, which allows it to be solved using gradient ascent usually an, LBFGS

Â algorithm. And because of the.

Â convexity of the function, we're guaranteed that this is going to give us

Â the global optimum regardless of the initialization.

Â The bad news, the further piece of bad news, other than the fact that it has no

Â close form optimization, is that in order to perform this gradient computation, we

Â need to perform probabilistic inference. And we need to do that at each gradient

Â step in order to compute the expected feature counter.

Â And we've already seen that inference is a costly thing, and so this really

Â dramatically increases the cost of parameter estimation and Markov networks,

Â say, in comparison to Bayesian networks. On the other hand, the features that we

Â are computing, the expectation. Are always within clusters in a cluster

Â graph or in a clique tree because of the family preservation property.

Â And and that implies that a single calibration step suffices for all of the

Â feature expectations at once. And so a single calibration say, of the

Â clique tree is going to be enough for us to compute all of the expected feature

Â counts, and and then we can go from there and compute the gradient.

Â