Hello, hello everybody, welcome back. Welcome to the fifth and final module for the course. Now in our last module, we were talking about uniformly most powerful tests which are great. They are best tests, you can't do any better. But in some cases, they don't exist. So often when you have a two sided alternate hypothesis, they don't exist but not every time. It's possible to have a two sided alternate hypothesis and have a UMP test. And we talked about that a little bit in the last module. So the other types of tests we've done really in this course have been what I would call intuitive or common sense tests. We said I want to test about the mean mu, so I'm going to use the sample mean X bar. I want to test about the variance sigma squared, so I'm going to use the sample variance s squared. I know when the alternate hypothesis is true, the statistics should be large or should be small. And so we we figured out the right direction for the test. So that's what I would call a common sense intuitive test. Then we have the name and person lemma, which gave us a recipe to follow to get a best or most powerful test. And in the case where there is no UMP test and your intuition fails you, maybe you're dealing with a parameter in a distribution that doesn't have a nice intuitive interpretation in terms of the sample, then you're going to want to use a GLRT. Now GLRT is the acronym G-L-R-T, it stands for generalized likelihood ratio test. We use the terminology likelihood ratio just once in our UMP section, and that is when we looked at a ratio of joint PDFs. And that terminology will become more understandable after we talk about likelihoods in this video, and the generalist likelihood ratio test is another test that uses a ratio of joint PDFs or likelihoods whatever that means. So we're about to get into it. In this first video, we're not going to do any hypothesis testing, we're going to do a quick and dirty review of maximum likelihood estimation. If you want to see this in more depth, you should go back to the second course in this three course specialization. Now if you've taken that or if you've learned about maximum likelihood estimators anywhere else, feel free to skip this video I think you'll find it quite boring. But for everyone else, I think we should do it. So let's go. I'm going to start off with a motivating example. My motivating example is this. I have a coin that is possibly unfair. It may not come up heads and tails with probability 1/2 each. I'm not sure what the probability of seeing heads is on the coin. So I'm going to call that unknown parameter p, and I would like to flip the coin a bunch of times and tried to estimate p. So p is going to be some number between 0 and 1 inclusive, it can be 0. I guess, if we have a two tailed, a two sided coin with tails on both sides. But what we're going to do is start flipping that coin and every time we see a heads, we're going to record a 1 and every time we see a tails we're going to record a 0. This gives us a random sample from the Bernoulli distribution with parameter little p, which is the same little p we're using here to represent the probability of getting heads on the coin. Now, p is unknown and we'd like to estimate it. We have common sense ways to do this, it's going be the proportion of 1's in the sample. But maximum likelihood estimation is sort of a more rigorous and formal way to do things, so you don't have to rely on your common sense or lack thereof. So here's the idea. You observe the data from your coin flips, a bunch of 0's and 1's. Remember, little p is the probability we see heads on the coin. Now if in your data you have a majority of 1's and not so many 0's, it's going to seem most likely to you that the true value of p is kind of close to 1. And if the majority of your data consists of 0's and there's not so many 1's in the data, that means you saw a lot of tales, that means you did not see a lot of heads. So it's going to seem more likely that the true probability p of seeing heads is kind of small and closer to 0. And there's all sorts of other things we can look at. If you see roughly 50-50 in your data, 50% 1's and 50% 0's, then the value of p that is more likely in the interval from 0 to 1 is kind of closer to the center and 1/2. And we can say this sort of thing about other kinds of proportions. 30% heads and 70% tales, things like that. So this is the idea behind maximum likelihood estimation. You want to observe your data and then once you have fixed data, you want to find the value of the parameter that makes that observed data most likely. So let's formalize this. Continuing in our example, the Bernoulli distribution has probability mass function f( x; p) that's the parameter. It is p to the x 1 minus p to the 1 minus x. And x can take on the value 0 and 1, and the probability mass function is 0 for other values of x. The parameter space here is all values of p between 0 and 1 inclusive. And given that we can write down the probability mass function, we can write down the joint probability mass function because our Bernoulli's are assumed to be independent and identically distributed. So we get to take a product of the single probability mass function with all of the different Xs plugged in. So when you do that, you end up with p to the sum of the x times 1 minus p, raised to the n minus or some of the x's. And all of the x's have to be in the set, the discrete set of 0 and 1. Otherwise, the joint probability mass function will be 0. Now, for this discreet example, the joint probability mass function, it has an argument vector of lower case x's. And that represents in the discrete case, the joint probability that capital X1 is observed to be little x1 and capital X 2 is observed to be little x2 and etc. And so if you think of the data as fixed, you have the observations. We want to look at this probability as a function of p, because there is a p in there. It's p to the some of the x's times 1 minus p raised to the n minus of sum of the x's. So thinking of the x's as fixed and constant, the maximum likelihood estimator for p is going to be the value of p that makes this joint probability the largest. It is going to be the value of p that makes us seeing these particular observations most likely. Again, p is called a maximum likelihood estimator. And here's how you would go about determining that value for p. Your first step would be to write out the joint pdf. So you can see the relationship between the p's and the x's. And then I want to think of the data, the x's as fixed. So if they're fixed, for example, this exponent, the some as I goes from one to end of the x's, this is just a constant. And so thinking of the joint probability mass function as a function of p. And sort of forgetting about the x's. Those are just constants in there. I want to give this function a new name. We're going to call it capital L(p) and that is known as a likelihood function. So it's known as a likelihood function and not the likelihood function. For reasons that will become apparent in a few slides. Our goal is to find the MLE for p. And that means we want to maximize this function as a function of p. So here is our likelihood function and our goal is to maximize this function as a function of p thinking of those x's as constant. Now, if you imagine a p axis and some likelihood function that has a maximum and the location of that maximum on the p axis is going to be what we're calling our MLE, it's our estimator and we're going to denoted by a p hat. Note that if I multiply that function by say 3, then all the values are going to kind of go up. So this curve is going to look kind of like a higher curve but the maximum is going to occur at the same place. So I can multiply or even divide the likelihood by a constant. And so if there was say a 3 in front of the joint pdf, I could drop it and that could be my likelihood or I could keep it and that could be my likelihood and were in the regime where the x's are considered constant. So if my joint pdf also had a constant in front, a multiplicative constant, that was say the product of the x's as i goes from 1 to n, that is constant with respect to p and that could be dropped. Giving us a more simple likelihood. So this is why I said a likelihood and not v likelihood. The likelihood function, you'll be okay if you always take it to be the joint probability mass function, but it might be easier to take it to be the joint probability mass function with some multiplicative constant removed. So we want to maximize this and that is the goal. In most cases it is easier to maximize the log of the likelihood. Because the log is an increasing function of x. If you have two numbers where one is bigger than the other and you put them into the log function, they may become much further apart or closer together. But the log of those x's will maintain the ordering. So the smaller x in the log function is smaller than the larger X in the log function. So we are messing up the likelihood, but it's not going to affect where the maximum occurs. And so it's almost always easier to maximize the log of the likelihood. But keep in mind that if you're doing a lot of maximum likelihood estimation, you're always going to be taking the log. It's almost always easier, but one day it won't be easier. One day looking at the log will make the problem more complicated. And so I want you to always remember that it's not necessary to take the log. So here is the log of this likelihood up here. I'm using the fact that the log of a product is the sum of the logs and the fact that the log of something to a power is the power times the log of the thing. If I want to maximize this, it's a function of p. I'm going to take a derivative with respect to p and set it equal to 0. I think the easiest way to solve this equation is not to try to combine the two terms on the left with a common denominator, but instead find a common denominator. In this case p times 1- p. And multiply both sides of the equation through by p times 1-p. On the right side where we have 0 will still have 0. And on the left side all of the denominators will cancel out. So there'll be no more fractions, but solve it however you want. Once you saw for p you should put capital X is back in the problem. Because what we're trying to report is a maximum likelihood estimate tour which is a random variable as opposed to a maximum likelihood estimate. I know I'm exaggerating those words but estimate tor vs estimate. The first is a random variable of the second is an observed number. And we want to look at properties of maximum likelihood estimators much like we've looked at properties of say a sample mean before we actually observe the sample. So that we can talk about probability and they're not locked in. So the maximum likelihood estimator in the end, you right down the likelihood function. You want to maximize it. It's usually easier to maximize the log of that likelihood function. So you take the log, then you take the derivative with respect to the parameter. You set it equal to 0. You saw for the parameter. In real life, you should check that you didn't actually minimize the likelihood. But it's a really strange likelihood and statistics that the solving of the derivative equaling 0 would give you a minimum. All of the nice known named distributions. The exponential, the normal, the gamma to the beta distribution, the pareto distribution. All of the nice known name distributions, you're not going to get a minimum. But in real life, if this is in your research, you might want to take a second derivative to make sure that you maximized and did not minimize the likelihood. So once you're done, you make sure all of the x's our capital and you throw a hat on it. This is our MLE for p. By the way it's x bar. And this is the common sense estimator for p because looking at the ones and zeros, the x bar is going to be the sum divided by n. The number of ones and 0's you have total and that's a proportion. That's the proportion of 1's you see. So in this case the maximum likelihood estimator is a highly intuitive estimator as well. Okay, so for continuous x1 through xn, say x1 through xn coming from a normal distribution with a mean mu and a variant sigma squared, if they can both be unknown or one can be unknown. But the idea that the joint probability density function is a joint probability is no longer true. It's a surface in n+1 dimensional space under which volume represents probability. So you have to integrate it to get probabilities. That said MLEs are defined in the same way you start with a joint pdf, even if it doesn't represent probability. And you turn that into a likelihood which may be equal to the joint pdf or you may drop some multiplication constants and you go. So for the example, in this video for the continuous MLE, I'm going to take a random sample x1 through xn from the continuous pareto distribution with parameter lower case gamma. And this Peredo distribution looks like this. There are other Pareto distributions. Some people don't have a one plus X to the gamma plus one in the denominator. Instead, they have an X to the gamma plus one. And the support for this function does not start from zero. It instead starts from one, and there's even more variations of the Pareto distribution. But this is the one I want to use for this example. So our goal is to find the maximum likelihood estimator for gamma. This lower case gamma, this parameter which by the way does not have a nice interpretation here. It's not the mean of the distribution or the variance of the distribution. It's kind of like a shape parameter. And how would you even go about estimating that by looking at a sample? This is one that again there's no I think intuitive way to go about it. So maximum likelihood estimation to the rescue the joint pdf because the excess our IID means you take the individual single pdf, you plug in all the different exes and you multiply. I did talk before about using capital X's versus lowercase X is I know you can't really tell from my font, but I'm using lowercase X is here. It's not important at all, these are just steps for algebra. But in the end, when you report the belly for gamma, make sure you put the capital random X is back in. So here is the pdf with the X, I plugged in. And when you multiply them, you get this as the joint pdf. And I think it's going to be convenient to take that exponent from on each term in the product to outside of the whole product. So we can do that. Are likelihood should either be this joint pdf or maybe it should be the joint pdf with some multiplication of constance dropped. If it's convenient because that can make the whole process easier when you have less stuff going on. In this case, there isn't an obvious constant. But there is, if you break up the denominator and you make it into the product of the one plus X, I raised to the gamma times the product of the one plus X. I raised to the one. Then you can pull a one over the product of the one plus X I. Out of the entire thing and drop it. So I'm leaving the entire joint pdf. But another likelihood would be something that looks just like this, but has the exponent in the denominator as gamma and not gamma plus one. I'm going to leave the plus one and we'll see exactly where it goes and why it doesn't matter that we left it in there or drop it the log of the likelihood. Remember your goal is to maximize the likelihood and that might not, I mean you want to take the log but it usually does the log of the likelihood looks like this. I used the fact that the log of a quotient is the log of the top minus the log of the bottom, and I again use the fact that the log of something to a power is the power times the log of the thing. So we get this and the next step I'm about to do is completely optional. You don't have to do this, but I'm going to do it. I'm going to rewrite this log of this product as a sum of logs. And the only reason I'm doing that, you wouldn't be wrong to leave it as the log of a product. But the reason I'm doing it is because if we want to get a maximum likelihood estimator for little gamma in the end. And then do some statistics with it, may be a hypothesis test with it. We're going to need to know certain distributions when the null hypothesis is true. And it's much easier in statistics to deal with distributions of the some of things rather than the product of things. So this s the statistician in me saying, I don't like seeing products anywhere turn it into a some, but that was totally optional. So the derivative of the log likelihood looks like this, and look what happened in the second term. So in the first term, the derivative of the log of gamma with respect to gamma is one over gamma. So that's what you're seeing there. And then in the second term, if you multiply that some of the logs out across the gamma, you have gamma times that some of the logs minus. Let's ignore that minus. So the second term is gamma times with some of the logs plus the sum of the logs. That second term has no gammas in it. When you take the derivative with respect to gamma, that term is going to become zero. And that's why it doesn't matter if you have that plus one in there. So setting this equal to zero and solving for gamma in the final step, I want to make sure all of my exes are capital. I want to make sure my MLE is a random variable. An estimate or rather than an estimate, just for the record. I can say these words correctly, estimator estimate, but I'm trying to emphasize those endings. So make everything capital throw ahead on it. You've got your MLE again in real life in important research, you might want to do a second derivative and make sure that you maximized and did not minimize the likelihood. But you don't have to worry about that for this course. It's not going to happen, it rarely happens. You would need a really weird pdf for that to happen. So that was a very brief review or introduction to maximum likelihood estimators. There are more complicated cases to consider like what do you do if you have two unknown parameters. For example, a normal distribution with unknown mean mu and unknown variants sigma squared. How do you find the maximum likelihood estimators for both of those at the same time? And there's also special cases when the parameters sort of define the support of the distribution. So let's say you have a uniform distribution on the interval from zero to some feta and you don't know what data is and you want to find a maximum likelihood estimator. If you go through the procedure, we went through right now and took the joint pdf, and made a likelihood and took the log of that and took a derivative and set it equal to zero. You'll find that there's no solutions and that's because the function is going to be a strictly decreasing. This is not clear right now. You'd have to write this out. But once you wrote it out, you would see that the log likelihood is a decreasing function of theta. And so you would have to think about maximizing it at some end point because we do have kind of finite intervals in there. So again, the cases we haven't covered that you can find in the previous course in the specialization are multiple parameters. Parameters in the support of the distribution and functions of parameters. If you find the maximum likelihood estimator of a parameter, mu is the maximum likelihood estimator of mu squared, you're estimator squared. The answer is actually yes, but that's not true of lots of different kinds of estimators, and that is known as the invariance property of families. So next up, we're going to get back to hypothesis testing and we're going to define what is meant by a generalized likelihood ratio test. So I will see you In the next one.