1:02

In addition if your data is strictly positive,

Â then having additive Gaussian errors presents a problem because that

Â allows for positive mass on negative values.

Â Now that may not be a problem and I often say that it's not a problem to have.

Â Strictly positive data where the arrows look out seen and

Â there's just very low probability down towards the negative values.

Â That's often okay.

Â But, on the other hand, if the normal distribution is putting a lot positive

Â probability on negative values even though you know your response has to be

Â positive then that's problematic for your model.

Â 1:45

On the other hand, you might try transformation.

Â A common transformation when your outcome has

Â to be strictly positive is to take the natural log of it.

Â A natural log, in my mind,

Â is perhaps the most interpretable transformation possible.

Â Putting that one to the side.

Â Lots of other transformations that are available to try to make our data

Â more normal.

Â Such as for binomial data, people often will take a so

Â called arc sine square root transformation.

Â That often destroys a large amount of interpretability

Â of our model coefficients, which is a real problem.

Â 2:36

There's another reason to perhaps approach generalized linear models

Â rather than doing a transformation of the outcome or to do

Â approximate things with a linear model is just nice and pleasant to have.

Â A model on the scale in which the date it was recorded without having

Â a transformation and that really honors the known assumptions about the data.

Â So, if we have binary data, a model that really honors the fact that the data is

Â binary and really doesn't require us to transform it.

Â It has a lot of pleasantness to it and it makes a lot of internal sense.

Â 3:20

And then I mentioned here on this last point,

Â the natural log transformation which is probably the most common

Â transformation is applicable for negative or zero values.

Â Now there's some fixes to that, but then that then harms some of the nice

Â properties of that transformation, some of the really nice interpretable properties.

Â So generalized linear models were from a 1972 paper by Nelder and Wedderburn,

Â a kind of famous paper that a few are PhD statistician you've for certain read.

Â A generalized linear model has three components.

Â First of all, the distribution that describes the randomness has to come

Â from a particular family of distributions called an exponential family.

Â This is a large family of distributions that includes things like the normal and

Â the binomial and the poisson.

Â 4:14

So that's the random component right,

Â that's the exponential family is the random component.

Â Then the systematic component is the so called linear predictor.

Â That's the part that we're modeling.

Â And we've done this very much so in linear models already.

Â The random component was the errors, the systematic component was the linear.

Â Component with the co variance, coefficients and

Â then we need some way to connected to and the link function it

Â connects some important meaning from the exponential family

Â distribution to the linear predictors so there's three things we need.

Â We need the distribution.

Â Which is going to be an exponential family for generalizing your model.

Â We need the systematic component which think of this as the linear predictor,

Â think of this as the set of regression code or regression variables and

Â coefficients and then we need a link function that links it to.

Â Okay, so let's try an example.

Â 5:17

And we'll go to our familiar example which is linear models.

Â The subject that we've been covering the entire class up to this point.

Â So, in this case we are assuming that our Y is normal with the mean Mu and

Â this happens to be an exponential family distribution and

Â then we're going to define the linear predictor, this a to i, okay,

Â to be the collection of co variant axis, times their coefficients, beta, okay?

Â And in this case,

Â our link function is just going to be the identity link function.

Â It just says that the mu from the normal distribution is exactly this collection,

Â this sum of co variants and coefficients.

Â And so this just leads to the same model that we've been talking about along.

Â We could just write it again as Y, Yi is equal to

Â the Mu component which is summation xi beta plus the error component.

Â So, we've simply written out the linear model in a separate way.

Â We've talked about.

Â The fact instead of saying, the error is normally distributed we say,

Â the why is normally distributed, which is a consequence of our other specification.

Â We specify a linear predictor kind of separately.

Â Okay. And then we just connect

Â the mean from the normal distribution to the linear predictor.

Â And you might think this seems like a crazy thing to do

Â when just righting it out as an additive sum with additive errors.

Â Seems like such an easy thing to do but

Â as we move over to different settings like the plus sign and the binomial setting

Â it'll be quite a bit more apparent why we're doing this.

Â So let's take in my mind perhaps the most useful variation

Â of generalized models logistic regression.

Â So in this setting we have data that are zero one so binary.

Â And so it doesn't make a lot of sense to assume it come from a normal distribution.

Â So the natural, the only distribution available to us for coin flips for

Â zero one outcomes is a so called Bernoulli distribution so we're

Â going to assume that our data, our outcome Ys, follow a Bernoulli distribution,

Â where the probability of a head, or a so called expected value of the y, is mu y.

Â Okay, so we're modeling our data as if it's a bunch of coin flips.

Â Only the probability of a head may switch from flip to flip,

Â and it's necessarily .5.

Â Okay, so the probability is given by this parameter Mu sub i.

Â The linear predictor is still the same.

Â It's just the sum of the covariance times the coefficients.

Â Now the link function in this case, the most famous and most common one,

Â is the so called logistic link function, or the log odds.

Â So in this case the way in which we're going to get from the mean from

Â the probability of the head to the linear predictor is to take the log of the odds.

Â So the odds are, the probability over one minus the probability, so

Â in this case we have written it is mu over one minus mu.

Â We're going to take the natural logarithm of it.

Â 8:57

So notice, we're transforming the mean of the distribution.

Â We're not transforming the Ys themselves, okay, right?

Â That's a big distinction and

Â that's the neat part of generalization in your models.

Â What we're transform, we're assuming that the coin has a probability the head,

Â if we're modeling our data, is coin.

Â That coin has a probability of getting a head and that probability, if we transform

Â it in a specific way, then, it relates to our co-variance and coefficients, okay?

Â So we can go backwards from the log odds

Â to get back to the mean itself, okay?

Â And so the inverse logic,

Â I guess I often call it the X pit, though I don't know how standard that is.

Â Is e to the eta i over 1 plus e to the eta i, and that gets you back to mu, okay?

Â So going forward, you take log of mu over 1 minus mu, that gives us eta.

Â If we take e to the eta over 1 plus e to the eta, that gets us back to mu.

Â And by the way,

Â 1 minus mu, the probability of the tail, is 1 over e to the eta.

Â 10:06

So we could write out our likelihood as the binomial likelihood

Â right there like this.

Â And I think you can see then,

Â it's through this likelihood, like we have talked about in our stat inference class,

Â it's through that likelihood that we're going to optimize,

Â we're going to maximize that likelihood to obtain our parameter estimates.

Â [NOISE] Okay let's go through another example Poisson regression or

Â I like to say Poisson regression.

Â I know I'm not pronouncing it correctly but I like to say it that way.

Â So assume that Y is Poisson mu i where the mu i is the expected value

Â of each of the Poisson random variables.

Â In this case the mu has to be larger than zero.

Â So Poisson is extremely useful for

Â modelling count data, or that's really what it is for, for modelling counts.

Â So if you have a bunch of positive counts that are unbounded, right?

Â So not like binomial counts where they're bounded by the number of coin flips

Â we take, Poisson counts are unbounded, and so it's a very useful model.

Â Let's suppose you want to count the number of people that show up at a bus stop,

Â you don't have an upper limit on that, or sure, there is some upper limit, but

Â you don't really know what it is, so you might want to model that as Poisson.

Â Our linear predictor is again, the same as it was in every case,

Â it's just the sum of the co-variance times their coefficients.

Â 11:36

The link function in this case is the log link.

Â The most common link function for the Poisson case is the log, the log link.

Â And remember, we go from the mean,

Â mu, to the linear predictor, eta, by taking the log of the mean.

Â Okay so again we're not logging the data, we're logging the mean

Â from the distribution that the data is assumed to come from.

Â And then remember the inverse of the natural logarithm is e to that thing.

Â So we can go backwards from eta back to mew by taking e to the eta.

Â So, by doing that we can just simply write out what our likelihood is,

Â and again the way GLMs work,

Â is they obtain the parameter estimates by maximizing the likelihood.

Â 12:24

So, I give some technical facts here and

Â basically we're just saying that the likelihood simplifies quite

Â a bit in all these cases because of the particular link function that we've shown.

Â But I want to point out this final point here which says that,

Â the maximum likelihood looks like sort of an equation that we would want to solve,

Â not unlike least squares.

Â So think way back to our initial lectures on least squares.

Â We found our estimates by minimizing the sum of the squared

Â vertical distances between the fitted line and the outcome.

Â Well if you wanted to opt to minimize that

Â function in a sort of automated way, you might take a derivative, so

Â then that function will no longer be squared, the two would come down.

Â And so you get to try to find the root of that derivative,

Â you just get a linear equation.

Â Well this generalization in generalizing your models by solving the likelihood,

Â trying to maximize the likelihood you get a very similar equation that you want to

Â set equal to zero and solve that gives you your estimate, and I give it right here.

Â And it's basically not very similar to the linear model

Â case only there's a set of weights and a variance in the denominator,

Â that doesn't go away like it does in the least squares case.

Â So again this is not for this class,

Â it's just if you're interested in some of the details of the fitting.

Â 13:58

Basically the point of this slide is to say that it's very similar

Â to what's going on in least squares,

Â just how we get to that point is a little bit more circuitous.

Â For most people, in most settings, this is all going to be very transparent to you.

Â You're going to mostly concern yourself with the interpretation

Â of your generalizing of your model, you're not going to concern yourself too much

Â with how the specifics of how it was fit.

Â 14:45

However, for the Bernoulli case, the variance of a coin flip is p*(1-p),

Â and in the notation we're given here it's mu(1- mu).

Â But remember, our mu depends on i, so what we're saying is,

Â the variance actually depends on which observation you're looking at.

Â Unlike the linear model case where the variance is constant across I.

Â Same thing in the Poisson case.

Â Variance of a Poisson is it's mean, so in this case,

Â the Poisson has variance that differs by I.

Â This is a modeling assumption that you can check, right?

Â So if you have Poisson data, you can, let's say you have several

Â Poisson observations at the same level of co-variance so

Â the mean should be the same.

Â Then the variance of those should be roughly equal to the mean.

Â If your data doesn't exhibit that, then that's a problem.

Â So this is a highly consistent, an important,

Â practical consideration in generalized linear models,

Â is that the modeling assumptions often put a restriction

Â on a relationship between the mean and the variance, and

Â that relationship may not hold in your specific data set, so.

Â What can you do?

Â Well there was a way to address this by having a more flexible variance model

Â even though you lose some of the assumptions of generalized linear models.

Â And these are all standard options and are and so all you look at our so

Â called quasi blank options in the family, in the distribution.

Â So we're going to go through lots of examples, but the point is is that's, so

Â if you see an R that you see that you can fit a model that's Poisson, with its

Â GLM function, but then you see there's another option called quasi-Poisson.

Â The same thing with binomial.

Â You can see an option where you'd fit binomial, but

Â then you see another option where it's quasi-binomial.

Â What that's referring to is a slightly more flexible variance model

Â in case your data doesn't adhere to the GLM variant structure.

Â 17:18

So, just some odds and ends about the fitting before we go through the specific

Â cases, we're going to do separately the Poisson case and the Binomial case, we're

Â not going to go through a full treatment of GLMs just Poisson and Binomiali.

Â But these equations have to be solved iteratively, so unlike the linear

Â model where you can just do strict linear algebra to find solutions,

Â GLMs actually have to be optimized which means sometimes the program fails.

Â For example if you have a lot of zeros in a binary regression or zeros and

Â ones, these things can happen.

Â 17:54

But other than that,

Â then I think most of the analysis should be pretty familiar to us.

Â If we want to get our predicted response, we're just going to take our coefficients,

Â our estimated coefficients beta hat, multiply them times our regressors, and

Â that we'll give us our predictive response.

Â Now notice this is going to be on the logent scale if you're doing, for example,

Â logistic regression, or the log scale if you're doing Poisson regression, so

Â you'll have to convert it back to the natural scale

Â if you want it to be on the same scale as the original data.

Â So if you're modeling coin flips and

Â you get your regression coefficients and out and

Â you come up with predicted response those will be on the logent scale and

Â if you want them to be back on the scale of the coin flip the zero or

Â one between zero or one you're going to need to take an inversion logent.

Â And again we're going to go through a lot of examples I just want to outline these

Â facts before we do the examples.

Â The coefficients are interpreted very similarly to

Â the way that our coefficients were interpreted in linear aggression.

Â They are the expected change in the, the change in the expected response per unit

Â change in the regressors, holding the other regressors constant, the only

Â difference is now, this interpretation is done on the scale of the linear predictor.

Â So the binomial cases on the scale of logent,

Â on the Poisson case it's on the scale of the log mean, and so on.

Â So it's a slightly more complicated interpretation, but again,

Â we are gaining the benefit of modeling our dot data naturally on its own scale,

Â and we haven't had to transform the outcome at all.

Â 19:37

So the inference, we also lose the nice

Â collection of closed form, normal inferences that you get.

Â We don't get t-distributions any more.

Â But largely, this is transparent.

Â There's a body of mathematics where statisticians and mathematicians have

Â figured out what the right distributions are to compare all the coefficients.

Â And from the output of your GLM too in order to get things like p values and

Â all of those are going to be tested or hypothesis tested the coefficients

Â are just going to be tested and interpreted in the same way as in our

Â linear regression just the background that's going on is a little bit harder.

Â One thing I would say though is all of these results are based on asymptotics

Â which means that they require larger sample sizes.

Â So if you have a GLM setting with a very small sample size,

Â 20:40

And so many of the ideas can be brought over from GLMs, so

Â this was just a whirlwind overview of GLMs Now for the next two

Â lectures let's just dig in to the two most important cases of binomial and

Â Bernoulli regression via logistic regression and Poisson regression.

Â So we're going to do spend a lot of time with those and then if you want

Â further material on GLMs there's some more advanced classes that you can take.

Â Okay, well thank you for attending this lecture and I look forward to seeing you

Â in the next one where we're going to cover logistic regression for binary outcomes.

Â