0:00

Hi, and welcome to our second to the last lecture.

Â This lecture is on Poisson GLNs, and I should give some credit to Jeff Leek

Â who I got much of this content from, from an earlier version of this class.

Â 0:15

So modeling count data arises quite frequently in applications.

Â For example, the number of calls to a call center, the number of flu cases.

Â And in each of these cases, the counts are unbounded in the sense of well,

Â there might be some theoretical bounded account,

Â the total number of people in the world or whatever.

Â However, we don't really know what that is or that number is really large,

Â relative to the count that we're looking at.

Â So in addition to count, data can come in the form of rates or proportions,

Â such as the percentage of people passing a test, or in terms of rates, think

Â about the number of cases or something like that, that occur over a unit of time.

Â My favorite example is from a nuclear pump failure experiment where we're looking

Â at the number of instances that nuclear pumps failure, per, failed per unit time.

Â So that would be a rate.

Â 1:16

A very common rate that occurs in bio statistics and public health.

Â Where I work in is, the so called incidence rate,

Â which is the number of newly developed cases per person time at risk.

Â Okay so all of these are instances of counts and

Â rates and proportions are also you can think of as counts.

Â Both because the numerator is a count and

Â whatever you're dividing by either the percent time at risk or the total time or

Â the total sample or the number of trials or something like that.

Â That's a second number that we're going to show you how to deal with as well when

Â looking at the numerator, the count part, okay?

Â And all of these can be handled with Poisson GLMs.

Â 2:18

web traffic and all these other things are modeled by Poisson distributions.

Â A very common use of the Poisson distribution is approximating binomial

Â probabilities where the success probability is very small and the end is

Â very large, so you can think of that as an instance of the sort of approximated and

Â unbounded count, even though the actual count is bounded.

Â 2:53

occurrences of a different collection of variables.

Â So if took a random sample of people and I counted the number of people that

Â had blonde hair, brown hair and black hair and I cross tabulated

Â that with the number of people who had blue eyes, brown eyes and hazel eyes.

Â Okay that table of counts is called a contingency table and

Â Poisson models are very useful for modeling contingency table data.

Â They give a very elegant framework for doing that.

Â 3:23

I give the Poisson mass function here and so

Â the rate of counts per unit time is lambda, whereas t is the total time.

Â If x is a plus on with this mean, then its expected value is t times lambda.

Â So the expected value is plus sign Is that's is the t times lambda.

Â So our natural estimate of the rate would be the count over the total time okay?

Â So x over t and it's nice to know in this case that the expected value of

Â x over t the expected value of our rate estimate is exactly lambda.

Â The rate that we would like the estimate.

Â So, that's the useful property associated with the Poisson.

Â The variance is equal to the mean, so the variance is e lambda.

Â So that's the assumption of our model that we can check and

Â we have some potential solutions of it's doesn't hold.

Â 4:20

And another interesting fact is the Poisson tends to a normal

Â as the mean gets large.

Â So you can think of this in several ways.

Â All that has to happen is for t lambda to get large.

Â This could occur if t is fixed and

Â lambda gets large, if lambda is fixed and t gets large, or both of them get large.

Â And in a lot of different applications the way in which

Â the mean gets large could vary but as long as it gets large in some sense

Â then the Poisson is going to approximate a normal distribution.

Â And here I show you this via simulation.

Â I simulate three different collections of Poisson random variables as

Â 4:56

the mean of the Poisson distribution gets larger and larger and

Â you can see by the right most panel that it's nearly identical

Â to a normal distribution at that point.

Â And then, we can actually show that we don't, if

Â this isn't the appropriate class the fact to show the mathematics that the meaning

Â of variants are equal theoretically so, a way could do that by simulation and

Â I do that here where I right, are not, I'm sorry this is an access simulation.

Â We're actually try to show it using the density and

Â summing up the density in the right way.

Â So if you're interested try that experiment and it will prove to you

Â that the meaning of variance or equal, try it for bunch of different scenarios.

Â Or you could just believe me or you could take for example mathematical

Â biostatistics boot camp one or two are my other course or classes.

Â Where we cover how to do the actual mathematics for this.

Â 5:52

So as an example, let's look at Jeff Leek, his web traffic.

Â So this is his website, www.biostat, or

Â I'm sorry, biostat.jhsph.edu/~jleek.

Â And the place I mean in this case is the interpretive a number of web hits per day.

Â So our unit our time in this case is T equal to one.

Â Now for the one to interpret the length that we estimate as web hits per hour we

Â would have to put the T equal 24.

Â So I hope you understand that and

Â if you want it to have it to be seconds you need to put a T equal 24 times or

Â minutes it would have to be T equal 24 times 60 and so on.

Â Let's look at the data I show

Â here how you can download it and I convert the date from

Â a standard character date time format to a Julian date.

Â Julian date counts the number of days since January 1st 1970 I believe.

Â 6:59

So the Julian date is nice to think about because it's just a count.

Â It's the number of days whereas the date is kind of a complicated

Â format because it's characters.

Â So when you do the head of the date here,

Â you see the date which is in character format.

Â You see the number of visits, and he is not doing so well.

Â These early dates with 0 visits on all those dates.

Â The number of visits that originate from simply statistics and the julian date.

Â So here's a plot of the data set.

Â The Julian date is on the x axis and the number of visits is on the y axis.

Â Now, we've covered in the last lecture what linear regression,

Â some of the shortfalls of linear regression is try to model count data or

Â in that case, binary data.

Â So let's not just re-hatch that same topic, there are some issues with

Â modelling count data as if it was with a linear model directly.

Â However, as we saw a couple of slides ago, as the mean of the counts gets larger and

Â larger are concerned over this decreases quite a bit

Â simply because it's going to trend to a normal distribution.

Â So, if you have extremely large counts, this becomes a lot less objectionable.

Â 8:10

So, that's just for notation number of heads, NH is going to be our outcome JD,

Â is the Julian day, that's going to be our predictor and this would be a linear

Â regression model, we can plot it and see the fitted line that we would get.

Â It has some issues.

Â Clearly there's some curvature there,

Â maybe we should have put an x squared term in.

Â But that would be our first approach to this, and

Â honestly it wouldn't be that bad.

Â But the counts are kind of small, so it's not the best thing in the world.

Â The interpretation isn't great for linear models,

Â then we'll see some ways which in the next couple of slides,

Â how we can tweak linear models to maybe get a slightly better interpretation.

Â I think that of counts in web hits and

Â things like that as things that you would want to think about on a relative scale

Â and the linear model really treats it on a linear additive scale.

Â So let's think about how we could get

Â relative interpretations from our linear model.

Â The first thing we might try is taking the log of the outcome,

Â here I knew the natural log.

Â 9:21

Now let me speak a little bit about log and what it's accomplishing.

Â The quantity e to the expected value of the log of a random

Â variable is what I would call the population geometric mean.

Â And the reason I would call it the population geometric mean is the empirical

Â or just geometric mean is the product of a sample,

Â product Yi, raised to the one over n power.

Â 9:44

So this the way to think about this, the product of yi to the one over n power.

Â If we take a log of that, we get the arithmatic mean,

Â the ordinary mean of the log data.

Â So the geometric mean is just exponentiating

Â the arithmatic mean of the log data.

Â 10:02

And we know that if we collect a lot of data, a lot more data in our sample,

Â the arithmetic mean will converge to something.

Â So the geometric mean is what this quantity,

Â the product of the data, rays to the one over nth power, what it converges to.

Â So, what, it turns out, when you take the log of the natural log

Â of the outcome in a linear regression then, your exponentiated

Â coefficients are interpretable with respect to geometric means.

Â So, for example, E to the Beta of zero is the estimated geometric mean hits on day

Â zero and I should reiterate the point from earlier on in the class.

Â This intercept doesn't mean that much because January first 1970 is not

Â a date that we care about in terms of number of web hits.

Â So probably to make the intercept more interpretable, what we should have done is

Â subtracted off the earliest date that we saw and started counting days from there.

Â From all of the remaining days in our data set and

Â then the intercept would be the e to the inner estimated intercept would be

Â the geometric mean hits on the first day of this data set.

Â Okay. So that's a small point but

Â it doesn't change the fitted model.

Â It doesn't change the slope or anything like that to shift around the intercept

Â however nonetheless, if you want an interpretable intercept as we know

Â from earlier on in the class, you have to do something like that.

Â E to the beta1 on the other hand is the estimated relative increase or

Â decrease in the geometric mean hits per day, okay?

Â 11:44

So I should also mention there's a problem with logs.

Â If you have zero counts you have to do something because you can't take the log

Â of zero, so you need to add a constant.

Â A very common constant that is plus one.

Â So we do log of the outcome plus one.

Â So if we do that, here I fit the linear model to the log of the outcome plus one

Â versus the Julian date.

Â We get the intercept which is kind of irrelevant in this case as we talked

Â about before.

Â And then we get 1.002.

Â This is on the exponentiated scale.

Â Okay so

Â what that means is our model is estimating a 0.2% increase in web traffic per day.

Â Okay?

Â And that's a nice interpretation.

Â If you added other covariates then that would

Â be 0.02% increase per day holding the other covariates fixed.

Â