0:01

So let's start about Linear vs Poisson regression.

Â So remember in GLMs, we don't love the outcome itself, or

Â we don't take a transformation of the outcome itself,

Â we take transformation of the mean of the outcome.

Â So linear models,

Â our outcome is the linear component plus the error, or we could

Â just write that as the expected value of the outcome is the linear compnent.

Â In a Poisson log-linear model,

Â it's the log of the expected outcome that is the linear part.

Â Log of the expected number of web hits per day is b0 + b1 times the Julian date.

Â We could reverse that process by exponentiating both sides of that equation

Â and just say the mean web hits per day now, depends on E to

Â the linear regression model, okay?

Â So, that's the main difference,

Â is that we're going to assume our data is Poisson distributed with a mean.

Â And that mean takes this form, E to the b0 + b1 times the regressor.

Â Okay, and that's the main difference,

Â though it changes the interpretation a lot.

Â We get a distribution that's much more believable for

Â our observed outcomes, okay.

Â And we get relative interpretations because everything's logged.

Â Our coefficients are going to be interpreted in a relative sense,

Â just like when we logged the outcome.

Â Though we're going to avoid problems like taking logs of 0,

Â like we had on the previous slide, okay.

Â Now, I want to reiterate, taking logs of your outcomes is actually,

Â often, a very good thing to do.

Â That's a trick that you should apply, not just for count data, but

Â in general on regression.

Â If you have positive data, log is one of the best transformations you can do,

Â it's extremely helpful.

Â The coefficients remain as, if not more, interpretable on the log scale.

Â So that's a great transformation to do.

Â Some of the other transformations, could be square root or cube root data,

Â then it gets hard.

Â 2:17

So if we've look at our model, it's the expected value of the outcome

Â is E to the beta naught plus beta one times the Julian date.

Â Well, by the properties of expected value,

Â we can factor out beta naught in beta one times the Julian date.

Â Now, if we looked at what would be the expected mean for

Â the next day, the Julian date plus one, right,

Â that would be e to the b0 + b1 (JD + 1), okay?

Â So divide this by that, and you get e to the B1.

Â 3:03

So our coefficient E to our slope coefficient is

Â interpreted as the relative increase or

Â decrease in the mean per one unit change in the regressor, okay?

Â And so if we exponentiate our coefficient, we're going to be looking at whether or

Â not they're close to 1.

Â If we leave them on the log scale, we're going to be looking at whether or

Â not they're close to 0.

Â And, again, all of these interpretations,

Â when we extend them to the mutivariant setting,

Â E to the beta one is the expected relative increase or decrease in web traffic,

Â holding the other coefficient, holding the other regressors constant.

Â Okay?

Â So I'm hoping at this point that most of this stuff is kind of old hat for you.

Â 3:54

Okay, so here is our fitted Poisson regression model overlayed onto our data.

Â And you can see it actually fits pretty closely to the linear model,

Â though it has some nice curvature to it, which is what we wanted.

Â We could have accomplished that in our linear model by adding a squared term,

Â of course, but it's nice to note that a simpler model

Â with fewer coefficients seems to fit the data better.

Â A concern, often, is that the variance has to equal the mean.

Â So the variance, in this case, needs to go up as the mean goes up.

Â But here if we plot the fitted values versus the residuals,

Â it's very clear that the variance is higher for lower mean values.

Â That's the problem, okay.

Â So we need at least some way to account for

Â the fact that the variance is not necessarily constant.

Â There's a lot of ways to do that, and if you read in the book,

Â one thing that we talk about are the quasi-Poisson models.

Â This model would look at the variance being a constant multiple of the mean,

Â rather than being equal to the mean.

Â But, in this case, that doesn't appear to be the case in this case because it looks

Â like we have this issue where there's larger variance for

Â lower fitted values, when the Poisson model assumes the opposite.

Â So Jeff actually had this code from this model, the sandwich,

Â which seems like a funny name for a package.

Â But it comes from the sandwich variance estimator,

Â made famous by generalized estimating equations, which by the way was

Â a technique that was invented here at Johns Hopkins Biostatistics by two very

Â well-known professors here, Scott Zeger and Kung-Yee Liang.

Â At any rate, Jeff did some code here for getting model agnostic standard errors.

Â And if you read in the book, there's a little bit more discussion about this.

Â This is kind of a more advanced topic than we would like to delve into in this class.

Â However, it's a very important applied topic, as well.

Â It's not just a theoretical exercise.

Â So, the main point is to do some residual plots, to understand whether or

Â not you think your model's assumptions hold.

Â Try quasi-Poisson model because that's a very easy thing to do in R,

Â if you think at least it holds at some level, but maybe not just in the sense

Â of the variance being a constant multiple of the mean.

Â But if it really fails, like in this case,

Â then you have to dig in to some other solutions.

Â 6:20

So, in this case,

Â you see here's the confidence interval on the top if we don't do anything.

Â If we do the model agnostic confidence interval, you get on the bottom.

Â In this case, it doesn't actually make that big of a difference between the two.

Â And, again, these are both, of course, non-exponentiated.

Â If you want to exponentiate them, which by the way, I should also mention,

Â exponentiating for a small coefficient like this it basically just adds 1.

Â So, probably, if I were to enter this into R, this would be about 1.002.

Â Just like before, about a 2% increase on the low end being estimated,

Â 0.21% increase estimated on the low end.

Â 7:20

So how do you handle rates?

Â So I should say rates and

Â proportions because I like to distinguish between rates and proportions.

Â So this is an instance where you have a count, and then you have some offset

Â that should tell you how large or small the count should be.

Â So, for example, if I'm counting failures of my nuclear pumps that I mentioned

Â before, I should have more failures if I monitored them for a longer time.

Â If I'm counting the number of flu cases,

Â I should have more flu cases if I'm looking at a larger population, right.

Â So I should have more flu cases in a bigger city than I would have

Â in a smaller city.

Â So in all these cases, if the counts that we're interested in have some

Â term that we really want to interpret our count relative to that,

Â whether it's a unit of time, person, time at risk, total sample size,

Â then what we want to do, and it's quite simple how to do this in R.

Â The first thing we note is that we want to actually interpret not the expected value

Â of the outcome, but the expected value of the outcome divided by this relative term,

Â whether it's monitoring time, person, time at risk or whatever.

Â So, in this case, Jeff is looking at the number of web hits originating from

Â Simply Statistic, relative to the total number of web hits, okay?

Â And he wants to model that as E to the b0 + b1 times the Julian date,

Â so he want to model that proportion as being this log-linear model.

Â Well, if you take the log of both sides, and we mess around a little bit,

Â we see that we get kind of a similar model to what we had before.

Â We get the log the outcome is the linear regression part, but

Â that also it has this log offset with no coefficient.

Â And that, it turns out, is all you have to do to

Â add a regular proportion into a Poisson GLM.

Â You just take whatever this relative denominator count or time or whatever

Â it is that you want to consider, and add it as a log offset in your linear model.

Â Okay, so the easy way to add this offset into our linear model

Â is just to use of the term offset equals log visits plus 1.

Â Remember, we have to add the plus 1 because we can't take the log of 0.

Â Remember, we have to have family equals Poisson in the model statement here,

Â and by default it assumes a log link which is what we want,

Â we haven't really covered any other kinds.

Â 10:15

Here, Jeff gives the, the difference between the GLM1 fitted rates,

Â which was, remember, to the number of web hits,

Â versus the GLM2 fitted rates, the assigned variable GLM2 fitted rates,

Â which was the relative number of web hits originating from Simply Statistics.

Â So, these blue points are adjusted for the red points in a sense.

Â 11:22

Then the Poisson model would allow it to, if it wants to fit the other counts.

Â And so this is called zero inflation.

Â And so there's a lot of different ways to handle zero inflation in Poisson data, but

Â you need to think about that.

Â In this case, yeah, we might be concerned with handling all of those zeros early on,

Â though there's a temporal component to zero inflation in this case,

Â which makes it even a little bit more challenging to model it well.

Â But Jeff used a package here that actually helps assist with modeling zero inflation.

Â