A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

43 ratings

Johns Hopkins University

43 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Hi, welcome to Lecture One Section B for Statistical Reasoning II. In this lecture section, we'll make things a little bit more concrete by talking about specific type of progression, simple linear regression, and we'll consider situations where our predictors are binary or nominal categorical.

So, hopefully by the end of this lecture set you'll understand linear regression in general provides a framework for estimating means and mean differences. And be able to interpret the estimated slope or slopes and intercept from a simple linear regression model with either a binary predictor or a nominal categorical predictor.

So now let's bring in some specifics about that Left Hand Side I was leaving as an empty box before. For a linear regression, the equation is actually relatively straightforward. The regression model is the mean value of a continuous outcome.

As a linear function of the predictor X one. And it is noted in the previous section and this applies to any type of simple regression, X one can represent a binary predictor. It can be modified to represent a nominal categorical predictor or it can represent a continuous predictor. We could also modify to represent a normal categorical predictor, so we have a lot of flexibility with what our predictor choices look like when we approach this problem as a regression framework.

So just to actually clarify that what we will be doing is estimating our regression results from a sample from some larger population. And to indicate that the intercepted slope quantities we get are just estimates of some underlined population level quantities that we can't directly observe. I'm going to dress these up and put hats on them to indicate that they're estimates. And just to keep the notation uniform. And to be comparable to what you'll likely see in textbooks and papers and that sort of thing, even though we're modeling the mean, yes I mean the mean which will be represented with a bar over the top, although variable Y, we'll frequently write this as y hat equals beta non hat plus beta one hat x1 where y hat is analogous to y bar. Just the mean of the y.

So, for any given value of x1 we can estimate the mean of y via this equation. And what we'll see and remind you of, remember this slope is what we sort of defined generically compares the outcome, left-hand side for any two groups to differ by one unit x1. And since our left-hand side is now a mean of a continuous variable, this slope compares the mean value of y for two groups who differ by one unit in our predictor x1. And hence, we'll have a nice interpretation. The slope will be interpretable as a mean difference between two groups.

So let's look at the first example to get some data on the table and look at some real results. So this is data on anthropometric measures from a random sample of 150 Nepali children who were between zero and 12 months old. So less than a year old. Question we might ask is, what is the relationship between average arm circumference and sex of a child. We've already looked at this before in a t-test context, let's look at how it shapes up as a regression.

And the values range from 7.3 to 15.6 centimeters in this group of children of mixed ages. And a little over half, 51% of this sample is female.

Here's a box plot display of the data we'll be looking at. And I just, and I'll do this now, remember we have a binary predictor. One way to handle that is code one group as a zero, one group as a one. So our males we will code as zeros arbitrarily. And our females is ones. And, so this is a box plot display. And you can sort of see that the distribution, couple things. First of all, the values for males tend to be more variable than the values for females. And they tend to rise or sit a little bit higher, larger. For example, a median is at least slightly larger than the median for females 75th percentile.

This is called a Scatterdot display, Scatterplot display and it's not particularly useful for this type of situation where a predictor's binary, but I just wanted to introduce this now for reasons that I'll show in a few slides. And this is not as informative as the box plot. And what this does is points, or plots all 150 individual measurements of arm circumference versus the sex of the child, when it's coded as zero for males or one for females. So each point here represents one male and his particular arm circumference on the vertical axis. Each point here represents one female and her particular arm circumference on the vertical axis.

So here is y's arm circumference a continuous measure, and our X or X1 is not continuous but binary male or female. And as we've laid out previously how are we going to handle sex as an X in regression, well it only takes on two categories and one possibility that's arbitrary is to code it as a zero for male children and a 1 for female children. So, let's take that approach for the moment. And the equation we will estimate using these data looks like this. We estimate the mean arm circumference y hat as a linear function of sex through these equation. So we'll end up with an estimated intercept and slope, as well.

So just to be clear and reiterate something, notice this equation first of all, its only estimating two values. We only have two groups of children as defined by their predictor. We have females and males. So we're only ultimately estimating two mean values. The estimate for female children in the generic representation before our estimated mean, arm circumference are y hat ,for female children is the intercept plus the slope for sex times one, since females are one. So the mean for females is the sum of the intercept, and that slope beta one hat. For males, it's simpler, the beta one hat drops out because males are zero and we get just the intercept. So beta one here, if we were to take the difference and average our circumference between females and males, we'd be left with beta one hatf. This estimates the mean difference in arm circumference for female children compared to male children. So it's, it's still a slope estimating the mean difference in y for a one unit difference in x1. But the only possible one unit difference when our x is binary is those who are coded one, to those who are coded zero.

So here, done with the aid of the computer and based on these actual data is the resulting equation.

Our estimate, our mean arm circumference to be equal to 12.5 plus the slope of negative 0.13 times x sex, or x1, which is a one for females. So the slope, as we've just clarified, equals negative 0.13. And from the previous slide, we know this is the estimated mean difference in arm circumference for female children compared to male children. In other words, female children have lower arm circumference by 0.13 centimeters on average, relative to males.

The intercept is equal to 12.5 centimeters and that estimates the mean arm circumference for male children.

You might say, well you've only got two groups here. Is that slope really a slope? Does it really describe the slope of a line? Well we only need two points to establish any line in space and what we're estimating are two means. And in fact the slope is the slope of the line that connects the mean for the group coded zero males to the group coded one females. And this slope, it's hard to see scaling wise in this predictor is that difference between the mean for females compared to males of negative 0.13. The coding choice we made for our sex predictor is completely arbitrary. There's no reason females have to be one and males have to be 0. So what I'd like you to ponder, and I'll come back to the review exercises is for this arm circumference and sex analysis what would the values of the intercept and slope be if sex was coded as a one for females, and a zero for males? Let's look at another example a data from our 2011 hospitalizations. Data from the nearly 13,000 members of Heritage Health who had a length of stay, cumulative length of stay of at least one day in the hospital in 2011. So the question we might have is what is the relationship between average length of stay and age of first claim? And what we're going to do is, this is going to be as the data represented, binary. It's either less than 40 or greater than or equal to 40. And I'm arbitrarily going to make the decision to make it a 1 if they're less than 40. And a 0 if they're greater than 40. And in these data, the average length for everyone was 4.3 days, with standard deviation of 4.9 days. And a range from one day total in 2011 to 41 days total. And 29% of the observations in this data were from persons whose first stay in the hospital happened when they were less than 40 years old. Their first 2011 hospital stay. So here's a box plot display of these data. And you can see we've already looked at this. And ostensibly it's stat reasoning one. But the distributions of length of stay are right skewed for both groups. And the distribution shifts up for those who are greater than or equal to 40 relative to the distribution for those who are less than 40. So there's a lot of crossover here. But those. It visually, at least. Those who are older than 40. Greater than or equal to 40. When they were first hospitalized in 2011 tended to have slightly longer length of stays. But that's a little hard to see in the visual so we'll ultimately want to quantify that and see if that holds. So we could fit this regression equation where we relate average length of stay to our predictor x1, which is coded as I noted before. So 1 if the subject is less than 40 years old at there, when they first went in hospital in the 2011. And a 0 if they were greater than or equal to 40 years old.

So this slope of negative 2.1 is the estimated indifference for those who were coded one versus zero. It's the estimated mean difference in length of stay for persons who were less than 40 at their first hospital claim in 2011 compared to persons over 40. So the younger group had an average length of stay of 2.1 days less than the older group. And the intercept, estimates the mean length of stay when x1 is 0, that's the group who is over 40 at their age, at their first claim. And the average length of stay for that group was 4.9 days. So what would be the estimated average length of stay for the younger group? Well, we take this average length of stay for the older group, 4.9. And add the difference between the two of negative 2.1 and that would give us the mean length of stay, which would be 2.8 days for the group who was less than 40 when they entered the hospital in 2011. Let's look at another example. Sometimes regression scenarios include predictors which are not continuous, not binary, but are multi-categorical. These are things, especially in the nominal world like subject's race, white, African-American, Hispanic, Asian, or some other classification, or say their city of residence, amongst four different places, Baltimore, Chicago, Tokyo, Madrid for example.

So how can we handle this type of situation when we have a nominal categorical variable in a regression framework? So we're going to explore this using the example based on the academic physician salary results we've looked at previously in Stat Reasoning One.

So this is the study in which data was collected on 800 U.S. academic physicians. And included information about their yearly salary. And a lot of additional information was collected including the sex of the physician and other factors. One of the other factors that was collected was the geographical region of the U.S of the entire United States where the job was located. So whether their job was located in the West, a part of the U.S., the northeast, the south, and the midwest.

So, the question that we might have to start is do average salaries differ by geographical region and, if so, what is the magnitude of these differences?

So can we do this analysis as linear regression? Previously, we would have thought of this as an analysis of variants where we're comparing mean. Salaries across four groups. Can we set up an analysis of variances of regression? And, if so, how can we handle a predictor that takes on four categories. Like, is predict, or region of the United States.

Well the first approach, so you might say, well let's arbitrarily give each region a numerical value. And, just for example we'll say, x1. Make it even a one if their job is on the west part of the United States. It's a two if they're in the midwest, three for a south, and four for a northeast. This is totally arbitrary. You could come, you could do this differently. And then we estimate an equation that relates the mean salary to region by this formulation we would in one occurrence of x, takes on a value of 1 through 4. This is not a good idea.

That coding I just put out is completely arbitrary. You could have come back and coded as a one for the midwest, a two for the northeast, etcetera, and depending on how we code this.

We will be treating this as an ordinal categorical variable and there's no logical ordering to the categories.

Furthermore, this type of coding assumes that the mean salary difference between regions is incremental. For example, under the coding I put forward, the diff we. This makes the assumption, when it estimates things, that the difference in average salaries between physicians in the South and the west, who differ by two units in x is twice the difference between physicians in the midwest and the west. And that's a strong assumption. That's forcing inordinality on these data that may not be there. And in my consulting collaborations I've seen people run models like this where they just take the predictor as is. And stick it in as coded one, two, three or four and pay no attention to the fact that it's nominal and not ordinal. And that can obviously have an affect on the results they get. So, it's easy to get caught up in this idea when you're running models, of just throwing things in and hitting the button on the computer. But sometimes you want to pause and think about what you're actually doing. So how can we handle this better when we have something that's not inherently ordinal in nature, but is categorical. Well it is kind of what we set up in the first section, but now we'll do it specifically. We designate one region as the reference region, so I'm going to arbitrarily say the West.

And we make binary indicators for each of the three other regions. You could do this differently, but the ultimate inclusions would be exactly the same. So I'm going to make three indicators for the other three regions, so I'm going to make x1 equal to one if the position works in the midwest, and a zero if they do not. When I create a variable called x2 which is equal to one if they work in the South US, a zero if not. And an x3 with is equal to one if they work in the Northeast section of the US, and a zero if not, if otherwise, if they work in any of the other three regions.

So here's a table showing these x values for each region. The West is the reference region. It doesn't get its own indicator, its value for each of those is zero. Because the West is not in the Midwest, not in the South and not in the Northeast. The Midwest, its indicator is x1, it takes on the value of one.

When the observation comes from somebody who works in the Midwest and has zero for the other two indicators. If they're from the South, they're not in the Midwest so the indicator for the Midwest is a zero. The indicator for the South, X2 is a one and the indicator for the North East is a zero. And similarity for those from the North East. They're not in the Midwest, so that's a 0. They're not in the South, so that's a 0. But they are in the Northeast, so that x3 is a 1. So we can now fit the regression model, and it looks like this. And, and this is a fancy equation to estimate only four mean salaries, but it does this in a linear equation framework. And so let's pause this for a minute. So, what is the intercept estimate? Well, this is the estimated salary when all of our x's are zero. Well from the previous slide we saw that all x's were zero when we're looking at the reference group, the West. So here the intercept has meaning. Its mean is, it is the estimated mean salary for physicians from the west. So this intercept will be a number that estimates the mean salary for physicians from the west. And then each slope or each coefficient of the x's estimates the mean salary difference between the region that has that corresponding x value of one and the reference region, the western states. So for example, just to write this out to give you an example. If we looked at physicians in the midwest, their value of x1 was one and their value of x2 and x3 is zero. The model estimates the mean salary for physicians from that group is being intercept plus the slope for x1. Be it a not, plus beta 1. For physicians who work in the West, all their x's are zero so, as we've established before, their estimated mean salary is just the intercept. So the difference in average salaries between those physicians who work in the Midwest and those who work in the West is simply the slope for that indicator of working in the midwest.

So here's the resulting equation. The intercept is $194,427. This slope for midwest is $4,416. The slope for the south is negative $35. You can figure this is plus negative 35. So beta two hat the slope to the south equals negative $35. And the slope for the northeast is negative $2222. So let's make sense of this positions of the west.

Their average salary is $194,474 annually. Physicians in the; for example, Midwest, their average salary is $194,474 plus the slope of the Midwest this $416 so, this means we can work this out. But, this indicates that those physicians in the Midwest make $4,416 more

For the Southern physicians, their estimated salary is this intercept again. Plus the for the indicator of the south plus negative 35. So on average, physicians in the South make $35 less per year

than those physicians in the West. And you can write this out for verification but if we were looking at physicians in the North East. Their average salary is $2,322 less than the average of 194,474 made by physicians in the west. So in summary, simple linear regression is a method for estimating the relationship between the mean value of an outcome, y and a predictor, x 1, via linear equation. When x is binary, the slope estimate, generically called beta 1 hat, estimates the main difference in why for the group with x 1 equal to 1, compared to the group with x 1 equal to 0. The intercept estimates the mean value of Y for the group with x1 equal to 0, the reference group. When x is a nominal categorical variable, this can also be done with the ordinal categorical and we'll look at examples with that for some of the other regressions. You need to designate one category to be the reference group and make separate binary axis for all other categories, and the slope of those binary indicators estimates the mean difference between the group whose value is one and the reference group.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.