0:00

Hi and welcome to the class on multivariable regression examples.

Â Let's start this discussion with a famous dataset

Â from a well known statistics text book, which is this Swiss Fertility Data.

Â And it's in the data sets package so I want to do require(datasets).

Â And then I wanna do data(swiss), and

Â then if you do ?swiss, it'll bring up the help file.

Â I have kind of a better formatted version of this on the slide.

Â And so what's in this data set is a standardized fertility measure and some

Â socio-economic status indicators for 47 French speaking provinces of Switzerland.

Â And this is from 1888.

Â So the variables are fertility, agriculture, which is the percentage of

Â the males involved in agriculture as an occupation, and examination the percentage

Â of draftees that scored a particular mark on a military examination in the province.

Â Education is the percent of the population that made it beyond primary school,

Â Catholic is, in this region at this time,

Â is just the percent Catholic as opposed to Protestant.

Â And infant mortality is the percentage of live births who live less than one year

Â from the province.

Â So that's the data set, and

Â we're interested in looking into what explains fertility in this province.

Â So, before we go any further, let's do some basic scatter plots of the data.

Â 1:42

And this library here, GGally or G-G-Ally I

Â think is probably how it's pronounced, is a collection of add-on tools for ggplot.

Â In particular they have a ggpairs command that does an equivalent

Â of the rpairs plot, and I think you'll see it here.

Â In a second, you'll understand what it does.

Â It's a little bit of a slow plot.

Â 2:07

Let me zoom in on the plot here, and now, what you see is, for example,

Â fertility is on the x-axis for all the plots in this first column.

Â Agriculture is on the x-axis, or the horizontal axis, for

Â all of the plots in this column, and it's on the vertical y-axis on this plot.

Â So hopefully, you see how it works.

Â So examination, for example, is on the y-axis for these two plots, and

Â then on the x-axis for these three plots, and so on.

Â The corresponding upper triangular part of this matrix

Â gives the correlation between those two variables.

Â So let's take, for example, fertility and agriculture.

Â Here's the scatter plot, you see it right here.

Â 2:53

Here is the which turns out to be fairly linear in this case.

Â And a confidence prediction, confidence band around it.

Â And it suggests that the correlation between the two,

Â the empirical correlation, is 0.35.

Â So let's, in the next subsequent slides,

Â investigate the relationship where agriculture,

Â the percent of the province that works in the agricultural industry, with fertility.

Â 3:26

Let's fit our fertility as an outcome, and

Â the first model I want to fit includes all of the variables as predictors, and

Â I'm doing this more for illustration, than I am to mimic a real data analysis.

Â I would say in a real data analysis,

Â your first model would probably be fertility with agriculture by itself.

Â Or the marginal relationship of interest without considering the other variables.

Â But I just wanna show you how to fit all of the variables at once.

Â So if you do fertility tilde period, this period then adds

Â all of the other variables in linearly, no square terms or anything like that,

Â no interaction terms, all of the remaining variables in the data frame all at once.

Â Saying data equals Swiss

Â tells it that the data set that I want is the Swiss data set.

Â 4:20

So that's the run of my LM, and notice,

Â you often want to assign the output of your LM to a variable so

Â you can save it later, but this runs quick enough that I'm just printing the output.

Â Now, summary of the output from LM gives a much more interesting output,

Â and then in particular, it gives you this table of all the coefficients.

Â Their standard errors, their t values, and their p values, but

Â it also gives you a lot of other information.

Â If you just wanna grab that coefficient table,

Â you can do $ coefficient on the output of the summary statement.

Â And there's just the table by itself.

Â So for example, if you want to use that in some subsequent code.

Â 5:03

So let's focus on this number, -0.17.

Â This number is interpreted as the following.

Â We expect a 0.17 decrease, now decrease because it's negative, right?

Â Okay, so we expect a 0.17 decrease in standardized fertility for

Â every 1% increase in the percentage of males involved in agriculture,

Â 5:36

This next column, the standard error 0.07, talks about how precise that variable is.

Â Oh, and before I move on,

Â I should also mention it's important to note that the percent

Â agriculture is expressed as a percentage rather than a proportion, right?

Â Because a percentage is the proportion times 100, so it would change

Â the scale of the coefficient by a factor of 100 so it's important to note

Â whether or not these numbers are percentages or proportions,

Â that would make a big difference in the scale of the coefficient.

Â Okay, so the standard error talks about how precise this coefficient is.

Â It talks about the statistical variability of that coefficient, and we get a 0.07.

Â If we wanted to perform a hypothesis test, that the coefficient for

Â agriculture, beta Agri is what I have here on the slide.

Â If we want to test whether not that's zero, then we would take the estimate,

Â subtract off the hypothesized value,

Â which in this case is zero, and divide it by the standard error of the estimate.

Â So the t statistic is nothing other than the estimate divided by

Â the standard error.

Â We can do that of course, but r very conveniently goes ahead and

Â gives it to us.

Â It's -2.448, but that's simply obtained by dividing -0.17 divided by 0.07.

Â That t statistic,

Â we can calculate the probability of getting a t statistic as extreme as that.

Â As small as negative 2.448 or smaller, and

Â because we're doing a two-sided test, we would double that p value.

Â So you can go through that t calculation if you'd like.

Â 7:14

The degrees of freedom are n minus the number of coefficients,

Â including the intercept.

Â But again, r does that on our behalf, and that works out to be 0.018.

Â So by standard thresholding rules, type one error rate of say 5%,

Â that would be statistically significant.

Â 7:46

So now let's contrast the model with a model having just agriculture

Â in as a predictor.

Â So the previous model had all of our other variables in this predictor, let's look

Â at one that just looks at this association between fertility and agriculture.

Â So, I'm running it and I'm just grabbing the coefficient tables so

Â we're not looking at the rest of the r's output.

Â But when you do this on your own,

Â you'll look at the rest of this summary output from LM.

Â What's interesting is that the agriculture variable is about the same magnitude,

Â 0.19 instead of 0.17.

Â However, it's changed signs.

Â Instead of agriculture having a negative effect on fertility,

Â it has a positive effect on fertility.

Â So, adjusting for these other variables changes

Â the actual direction of the effect of agriculture on fertility.

Â And this is the impact of something so-called Simpson's Paradox.

Â Or I just think it should be called Simpson's Perceived Paradox,

Â because there's nothing paradoxical about the possibility that accounting for

Â other variables should change the nature of the relationship

Â between a predictor and a response.

Â It actually makes quite a bit of sense that that would be the case.

Â 8:58

Notice of course,

Â in both cases the agriculture coefficient is strongly statistically significant.

Â So what I'd like to do in the next slide is just create via simulation,

Â an example where an effect can reverse itself.

Â So that maybe it'll get us thinking about how this could happen.

Â In the end, regression is gonna be a dynamic process,

Â where you're going to have to think about what variables to include.

Â And you're going to have to make the kind of scientific arguments, if you want.

Â If there hasn't been randomization to protect you from confounding, you're gonna

Â have to go through a scientific dynamic process of putting confounders in and

Â out and thinking about what they're doing to your effective interest

Â in order to evaluate it.

Â So let's invent a little simulation, just to illustrate how this can happen.

Â But there's a variety of ways it can happen, and this is just one.

Â But let's just, so you can see it happen as we code it from a real

Â generating process where you can control everything.

Â So I'm gonna assume that I have 100 data points, n is 100,

Â and then my second regressor, x2, is just gonna be the values 1 to n.

Â So if you look at x2, it's just 1 to 100.

Â So think of x2 as something that we might measure regularly, like days, so

Â we measured 100 consecutive days.

Â And then x1 is a variable that depends on x2 and it depends on random noise.

Â So let's just make something up.

Â So x1 depends linearly on x2, so it grows linearly with time.

Â So let's say, maybe, hopefully x1 is your bank account.

Â It's your money, it goes up with time linearly.

Â And then there's all these random fluctuations that impact your spending, so

Â your money doesn't necessarily always just go up.

Â It goes up and down sporadically, but there's this linear trend of it going up.

Â 10:50

Okay so there's my x1 and it's gonna look different, right?

Â So if I were to, for example, plot x1,

Â it goes up but it's got all this random noise around the line.

Â Okay now y, let's say y is happiness.

Â So measure of happiness.

Â So what happens with y?

Â The true generating model is y is negatively associated with x1.

Â Uh-oh, so you're happiness is negatively associated with your money.

Â I guess mo' money, mo' problems is what that's suggesting.

Â But then it's positively associated with x2, so it goes up with time and

Â down with excellent.

Â And this is the, and then there's some random normal noise.

Â And this the correct model because we're using it for simulation.

Â So we know that it's the actual truth.

Â Y, our outcome that we're simulating,

Â depends negatively on x1 with a coefficient of minus 1, and

Â depends positively on x2 with a coefficient of plus 1.

Â Okay, so let's simulate our y.

Â 11:55

And what we see is we get this enormous coefficient, 95, which is clearly wrong.

Â It's nothing near to the negative 1 that it's supposed to be or

Â that we would hope it would be.

Â It's in fact quite the opposite.

Â And what's happening, it's sort of picking up this residual effect of x2,

Â this big line, this big component of x2 that's a big driver of y,

Â is getting picked up in x1.

Â 12:20

Okay, but now what happens if we fit the correct model, x1 and x2, together.

Â Well, there we get the correct coefficients,

Â about minus 1 for x1, and about minus 1 for x2.

Â So there you can see,

Â if you do the correct sort of adjustment, then it should work out.

Â And you can imagine why this would happen.

Â What is regression doing?

Â It's taking x1 and removing the linear effect of x2.

Â So this part right here, this 0.01 x2 should get removed and

Â that uniform noise, that's the correct part of x1 that's unrelated to x2,

Â will get related to the part of y that's unrelated to x2 as well.

Â So, let's do some plots to highlight this, just to show us how it works a little bit.

Â 13:06

So I'm gonna create a data frame, because ggplot likes data frames,

Â and I can't remember if I loaded ggplots, so let me reload it.

Â And then my plot is gonna have this data, dat.

Â And the aesthetic, I named my variables conveniently y.

Â And then my x variable is gonna be x1.

Â So the kind of important variable, x1.

Â And then my color is gonna be x2, okay?

Â 13:58

And you can see the line that it's fitting is not incorrect.

Â There is a clear linear relationship, positive linear relationship,

Â between the x1 and y.

Â However, you can see, with x2, which is the color,

Â that there's also this clear positive gradient.

Â As y goes up, so does x2.

Â And also you can see as x1 goes up, so does x2.

Â So you can see the sort of confounding that's happening here.

Â So why don't we now see what happens if we plot the residuals, okay.

Â So I created in my data frame, I created the residuals where I fit x2 on y.

Â And then I took the residual where I removed x2 from x1.

Â So let's see what happens when we do the residual plot.

Â So this should be a plot that would give the slope of this line will be

Â the coefficient from our linear model fit where we include both x1 and x2.

Â 15:01

And here's the plot.

Â Now you can see that for the residual y and

Â the residual x1, there's a clear negative linear relationship, and if you stare at

Â it enough, you realize that the slope of this line should be around negative 1.

Â 15:26

Okay, and so that's what linear models is doing.

Â It's removing the x2 from both x1 and y.

Â And that's why you get this sorta correct relationship when you fit both variables.

Â Now this doesn't mean that throwing every variable into your regression model is

Â the right thing to do.

Â There's consequences to throwing in extra variables,

Â unnecessary variables, into your regression relationship.

Â 15:57

Remember our agriculture effect reversed itself after we included the other

Â variables in the model.

Â And particularly you'll find that this happens quite a bit when education and

Â examination are included.

Â So educational attainment is negatively correlated with the percent working

Â in agriculture, a correlation of -0.64.

Â And you would kind of expect that, given where the data was collected and

Â the time period in which it was collected.

Â And in addition, as I mentioned earlier, education and

Â examination are kind of measuring the same thing.

Â Their correlation, those two variables is 0.7.

Â They're sort of measuring the same thing.

Â 16:36

So the question arises, is this a positive association between agriculture and

Â fertility by itself when done via ordinary linear regression?

Â Is that artifactual for not having accounted for these other variables?

Â Education, by the way, does have a stronger effect on fertility.

Â 16:56

So at the bare minimum, anyone claiming that they did a linear regression with

Â percent of the province working on agriculture, and fertility is the outcome.

Â And claimed a causal relationship between agriculture and fertility,

Â causal positive relationship, that conclusion would definitely be suspect.

Â First of all because from observational data like this, it's always suspect

Â to make strong causal conclusions without putting a lot of extra work in and

Â thought about your assumptions.

Â But even if you were just willing to claim an association between agriculture and

Â fertility, that also would be suspect, because you can so

Â easily break that association and reverse it by the inclusion of other,

Â very reasonable variables.

Â 17:42

I also wanted to show you really quickly what happens if you include

Â a completely unnecessary variable in the model.

Â And what I mean by completely unnecessary is that,

Â what if you include a variable that is simply a linear combination of the other

Â variables that you've already included.

Â 18:04

is random noise, for example.

Â And that's a different issue, we'll talk about that later.

Â But right now, why don't we see what happens if we include a variable,

Â that in what I am describing, is completely unnecessary.

Â So, for example, if I create this variable z,

Â that's nothing other than agriculture plus education, added together.

Â Well that has no added value.

Â We've already included education and

Â we've already included agriculture in our joint model.

Â So why don't I show you what happens when you fit all of the variables

Â plus this additional variable z.

Â And what you see is our agriculture coefficient hasn't changed,

Â it's -0.17 just like before.

Â And our z variable is giving a NA.

Â And so, whatever it does, whatever r does, is it takes the last sort of completely

Â unnecessary variable included in the model and gives that one an NA coefficient.

Â 18:57

So that's an important thing to remember if you see an NA in r after you

Â fit a linear model, then probably the most likely culprit is you've

Â included a variable that is either numerically or

Â exactly a linear combination of the other variables.

Â