0:00

We begin by discussing this simplest learning problem, which is that of

Â estimating a set of parameters. And the key building block, for parameter

Â estimation is what's called maximum likelihood estimation, so we're first

Â going to define that and then build on that in the context of more interesting

Â problems. So, let's go back to the simplest

Â possible learning problem where we are generating data from a biased coin which

Â is a Bernoulli Distribution and the probability that x takes the value one is

Â some, defined by some parameter data, and the probability that it takes the value

Â zero is therefore one minus data. And let's imagine that we're given access

Â to a data set D, which is a bunch of instances x1 up to xM that are sampled

Â IID from the distribution P. Let's remind ourselves what it means for,

Â what the IID means. So first, the first I means that the tosses are independent of

Â each other. And the ID means that they're identically

Â distributed which means that they're all sampled from P.

Â And our goal is to take D and reconstruct data.

Â So, let's think about this as a probablistic graphical model, in fact,

Â it's probablistic graphical model that we've seen before when we first talked

Â about Crate models. So here, we have a parameter, theta, and

Â a bunch of, replicates of the same experiment, which are these x's over

Â here. Which our sample from theta

Â independently. And we can see that if you think about

Â this as a graphical model as written over here, we can see that each xi depends on

Â theta. And that they're conditionally

Â independent of each other once theta is known.

Â So, they're independent and identically distributed, given theta.

Â And if you think about this as a graphical model, well, it better have

Â CPDs, and so the CPD for the M toss, given theta is the part, is that the x

Â takes the value x1 with probability, theta,

Â and the probability, the value x0, with probability one minus theta,

Â and that's a completely legitimate CPD. It's a CPD whose parent is the value

Â theta, which is the parent of the x variable.

Â It's a continuous variable, but that's, we've done those before.

Â 2:43

So, now that we've specified this as a probablistic graphical model, we can go

Â back and think about how maximum likelihood estimation would work. And so,

Â the goal is to find the, parameter of state L, which is in interval zero, one

Â that predicts the D well. So, the quality of our, of a particular

Â parameter theta is how well it predicts D,

Â which can also be viewed as how plausible is D given a particular value of theta.

Â So, let's try and understand what that means.

Â We can ask, what is the probability on the D that we saw, given a particular

Â value of theta? And since the value, the, the tosses, or

Â the instantiations xi are conditionally independent given given theta, we can

Â write this as the product over all of the m instances of the probability of the nth

Â instance given theta. And so now, let's think about what that

Â is in the context of a concrete example. Let's assume that we have tossed this

Â coin five times and we have three heads and two tails or three ones and two

Â zeros. So, if we actually write down what this

Â probability function looks like, we can see that that's going to be

Â 4:13

the probability of getting head given theta times the probability of getting

Â tails, given theta times the probability of the second tail and the probability of

Â heads times another probability of heads. And the probability of the first head

Â given theta is theta one minus theta for the tail, one minus theta, theta, theta.

Â Or theta to the power of three, one minus theta to the power of two and that is

Â exactly the function that is drawn over here.

Â Now, if we're looking for the theta that predicts D, well, we just defined that as

Â the theta that maximizes this function, if you draw a line from this maximum down

Â to the bottom, you can see that this function is maximized at the point near

Â point six, which, not surprisingly, is the same as the three heads that we saw

Â over the five total pauses. But generalizing that, let's assume that

Â we have observed in this context, MH heads and MT tails, and we want to find

Â the theta that maximizes the likelihood and just as, in the simple analysis, in

Â the simple example that we had, this likelihood function is going to have

Â theta appearing MH times and one minus theta appearing MT times.

Â And so, that's going give us a likelihood function that looks just like the

Â likelihood function that we saw on the previous line.

Â And if we think about how we can go about maximizing such a function, then usually

Â we take the following steps. First, it's convenient to think about not

Â the likelihood but rather what's called the log likelihood, denoted by a small l,

Â which is simply the log of this expectant over here and that has the benefits of

Â turning a product into a summation. And now that we, so that gives us a simpler

Â optimization objective but it's the one that has the exact same maximum.

Â And we can now go ahead and do the standard way of maximizing a function

Â like this, which is differentiating the log-likelihood and solving for theta.

Â And that's going to give us an optimum which is exactly as we would expect, that

Â is, it's the fraction of heads among the total number of coin tosses and that's

Â the maximum of this log-likelihood function and therefore, the likelihood is

Â well. Now, an important notion in the context

Â of maximum likelihood estimation is also important in, in when we develop it

Â further is the notion of a sufficient statistic.

Â So, when we computed theta in the coin toss example, we defined the likelihood

Â function as an expression that has this form.

Â And notice this expression didn't care about the order in which the heads and

Â the tails came up. It only cared about the number of heads

Â and the number of tails. And that was sufficient in order to

Â defined a likelihood function, therefore, sufficient for maximizing the likelihood

Â function. And so, in this case, MH and MT are what

Â known, are what's known as sufficient statistics for this particular estimation

Â problem because they suffice in order to understand the likelihood function and

Â therefore to optimize it. So, why generally a function of the data

Â set is the sufficient statistic? if it's a function from instances to a

Â 8:02

vector in Rk if, it has the following property, if this function s of D, has

Â the following property. For any two data sets, D and D prime, and any possible

Â parameters theta, we have that the following is true.

Â If s of D is equal to s of D prime, then the likelihood function for those

Â two data sets is the same. So, what's the s of D?

Â s of D is the sum of the sufficient statistics for all of the instances in D.

Â So, let's make this a little bit more crisp.

Â What we're trying to do is we're trying to look at a bunch of data sets,

Â for example, all sequences of coin tosses.

Â And we're trying to summarize. data sets using a smaller, more compact

Â notion, which is statistics that throw out aspects of the data that are not

Â necessary for for reconstructing its likelihood function.

Â So, let's look at the multinomial example, which is the generalization of

Â the Bernoulli example that we had before. So, assume that what we have is a set of

Â measurements of a variable x, that takes on k possible values.

Â Then, in this case, the sufficient statisticsit, just like before, whereas,

Â we had the same number of heads and the same number of tails.

Â Here, we have the number of values, the number of times each of the k values came

Â up. So, we have M1, M2, up to Mk.

Â So, for example, if you're looking for a sufficient, for sufficient statistics for

Â a six-sided die Â·Â , you're going to have M1 up to M6 representing the number of

Â times that the die came up one up to the, and number of times it came up two,

Â three, four, five, and six. And so, what is the sufficient statistic

Â function in this case? It's remember, it's a two-fold of

Â dimension, of dimension k, which are the different numbers of values, which are

Â the number of different values of the variable.

Â And the sufficient statistics for the value xi is one where we have a one only

Â 10:34

in the ith position, and zero everywhere else.

Â So, if we sum up s of xM over all instances in their data set, we're going

Â to get a vector, where, you only get a one contribution when the instance, when

Â the M, when the Mth data instance comes up that particular value, and so that's

Â going to be M1, M2, Mk. And this is sufficient statistic because

Â the likelihood function then can be reconstructed as a product of theta i,

Â Mi, where this theta i here is the parameter for x equals little xi.

Â 11:30

Let's look at a different example. Let's look at the sufficient statistic

Â for a Gaussian distribution. So, as a reminder, this is a

Â one-dimensional Gaussian distribution that has two parameters, mu, which is the

Â mean, and sigma squared, which is the variance.

Â And that can be written as, in the following form which is one that you've

Â seen before. And we can rewrite the exponent in in the

Â following way, we can basically blow out the quadratic term in the exponent and

Â you end up with the likelihood function that has -x squared times a term plus x

Â times the term minus a constant term. And the sufficient statistics for

Â Gaussian can now be seen to be x squared, x and one.

Â Because when we multiply P of X for multiple occurrences of X, we're going to

Â end up adding up the x squared for the different, for the different data cases,

Â adding up the x's for the different data cases and then this is just going to be

Â the number of data cases. So, the s of data set P is going to be

Â the sum over M, X of M squared, the sum over M with M and this [UNKNOWN].

Â And from that, we can reconstruct the likelihood function.

Â 13:13

How do we now perform maximum likelihood estimation? So, as we talked about, we

Â want to choose theta so as to maximize the likelihood function and if we just go

Â ahead and optimize the functions that we've seen on previous slide for

Â multinomial, that maximum likelihood estimation turns out to be simply the

Â fraction. So, for the value xi, it's going to be

Â the fraction of xi in theta, which again, is a perfectly, very natural

Â estimation to use. for a Gaussian, we end up with the

Â following as the maximum likelihood estimate.

Â The mean is the empirical mean. in the distribution.

Â That is the average over all of the data cases and the standard deviation is the

Â empirical standard deviation. So to summarize, maximum likelihood

Â estimation is a very simple principle for selecting among a set of parameters given

Â data set D. We can compute that maximum likely

Â destination by summarizing a data set in terms of sufficient statistics, which are

Â typically considerably more concise than the original data set D.

Â And so, that provides us with a computationally efficient way of

Â summarizing a data set so as to the estimation.

Â And it turns out that for many parametric distributions that we care about, the

Â maximum likehood estimation has an easy to compute closed form solution given the

Â sufficient statistics.

Â