0:03

So you might be surprised to find out how flexible linear regression models are.

Â For example, you can fit factor variables as regressors and

Â come up with things like analysis of variance,

Â if you've heard of that before, as a special case of linear models.

Â Let's go through an example where we have one covariant, X equal to zero or

Â one, and let's see what happens when we put that into a linear regression model.

Â So here I have my model, Y, my outcome, is beta-naught,

Â an intercept, plus X times beta one plus an error term.

Â Where here now my X only takes the value zero for, let's say,

Â people in a control group and one for, say, people who received a treatment.

Â 0:44

Then, for the people who will receive the treatment, the group of people where

Â their X value is one, their mean is beta-naught plus beta one.

Â For the people who are in the control group,

Â those people where their covariant X is zero, their mean is beta-naught.

Â 1:01

If you were to fit this, as you would expect, the estimated mean for

Â the treated group is just the mean of the people who are treated.

Â So that beta one hat plus beta not hat works out to just be the mean for

Â the treated group.

Â Similarly, beta not hat, by itself, works out to be the mean for the control group.

Â 1:34

So that's just a nice way to be able to fit,

Â factor a two-level factor variable as a linear regression variable.

Â And it gives you, not only the fitted values tell you about the means for

Â both of the groups, but

Â it gives you an inference for comparing the two groups automatically.

Â That T test, by the way, the T test for

Â beta one, is exactly identical to a two-group T test where you

Â assume a common variance if you happen to have taken the inference class.

Â 2:02

We can extend this to more than two levels.

Â For example, imagine if you had a three-level variable.

Â For example, you have some outcome but you wanna compare it to U.S.

Â political party affiliation.

Â In this case, let's say you were only considering those that were Democrats,

Â Republicans, or registered Independents.

Â 2:22

Well, you can do that by having a variable X1, that's one for

Â Republicans and zero for otherwise, a variable X2 that's one for

Â Democrats and zero for otherwise, and then, I'll tell you here in a minute

Â why we omit the X3 that would be one for Independents and zero otherwise.

Â That one, it would happen to be redundant.

Â Okay, so if a person is a Republican,

Â then their mean is gonna be beta-naught, plus this first X term is gonna be one.

Â So plus beta one and the second X term is gonna be zero and

Â so their main will be beta not plus beta one.

Â If the person is a Democrat, then it's gonna beta-naught, then X1 will be zero,

Â so that term will drop out, then X2 will be one and so it'll be plus beta two.

Â So for the Democrat,

Â their mean from the regression model would be beta-naught plus beta two.

Â And then if they're an independent, both these S terms will be zero and

Â it'll just be beta-naught.

Â And that's why we can't include a third time, right?

Â Because if we know that you're Republican, in the way that we've set up the variable,

Â if we know that you're not Republican and not a Democrat,

Â then you must be an Independent in our data set the way we've set things up.

Â And so it would be redundant to have a third variable in there,

Â it wouldn't have any new information.

Â Here we have three means, Republican, Democrat, and Independent, and

Â three parameters, Beta-naught, Beta one, and Beta two.

Â If we were to add an extra parameter, it would kind of break the model.

Â And I'll show you in R what happens when you do that in a minute.

Â 3:55

So if we look at our means here, if we compare beta-naught and

Â the mean for the Independents versus the mean for the Republicans,

Â so we subtract those two, we get beta one.

Â So beta one compares Republicans to Independents.

Â And beta two, similarly, compares Democrats to Independents.

Â Then, of course, beta one minus beta two compares Democrats to Republicans.

Â So what happens is by omitting the regression variable for the Independents,

Â then the intercept became the value for the Independents, and

Â all of the other coefficients have become interpreted relative to Independents.

Â The beta one, in fact, the one in front of the Republican covariant is now

Â interpreted as the change between Republicans and Independents.

Â The beta two, the one in front of the Democrat covariant

Â is now interpreted as the change between Democrats and Independents.

Â And this was all a consequence of having omitted the one regressor for

Â Independents.

Â If we had included the regressor for Independents and excluded the one for

Â Republicans, then the intercept would be for Republicans, and

Â the coefficient in front of the Democratic one would be Democrats versus Republicans.

Â The coefficient in front of the Independent one,

Â would be Independent versus Republican,

Â and we'll go through some more examples just to illustrate how this works.

Â And R kinda does this on purpose, or R kinda does this automatically for you,

Â if you include it as a factor value.

Â It picks one of the levels to be the reference level.

Â And so, let's go through some examples, hopefully, that'll shore this up,

Â but the main point I'd like to get across is, whenever you're dealing with factor

Â variables in linear models, what you set at your reference level has a big effect.

Â These coefficient are interpreted quite differently,

Â depending on how you sent them up and what you set up as your reference level.

Â 5:46

Okay, so let's go through an example in R, where we look at a factor variable and

Â see how R is treating it.

Â So, I want to make sure I require the data sets package.

Â We've already loaded that in in this lecture but

Â let's just do it again just to remind ourselves.

Â And then I have this data InsectSprays, and then I'm requiring the stats package.

Â I don't know if that's technically necessary for what I'm doing, but

Â if you do help InsectSprays, InsectSprays,

Â here, it gives you the help file for this data set, and

Â the outcome is a count is a numeric insect count.

Â So, presumably, number left after applying the spray,

Â and the spray factor is the type of spray, okay?

Â And then it gives some examples of working with this data, but

Â we don't need that cuz we're gonna build our own examples.

Â So let's first plot some of the data.

Â So I want to do a ggplot and I've already loaded ggplot2,

Â but just to remind you, in case you're restarting your R session from earlier,

Â you want to make sure that you acquire ggplot2.

Â There, it's loaded.

Â And then I have my ggplot and then my data is InsectSprays.

Â And then for my aesthetic, my Y is the count, the number of insects.

Â My X is the spray.

Â They don't give you too much information about the sprays, but

Â there's a couple of different sprays that they use.

Â And then I want to fill the objects I'm creating with the factor variable spray.

Â So there I've created my ggplot.

Â And then I wanna do a violin plot.

Â A violin plot is kind of like a histogram but sort of tilted on its side.

Â And then they repeat it on both sides so it looks a little like a violin.

Â Well, it looks like a violin if you're data cooperates.

Â Otherwise, it looks like a blob.

Â 7:39

Okay, there's our violin plot.

Â And then I wanna set my labels.

Â And then if you wanna actually see the plot, you gotta bring it up.

Â Okay, so, here's my violin plot.

Â So you see there's sprays eight labeled spray A, B, C, D, E, and F?

Â Okay. And you can see the insect counts, so

Â I presume they applied the spray to numerous batches of insects and

Â they, >> It's unfortunate they're not telling me

Â whether or not the count is the count of the number of alive or the number dead.

Â 8:22

We don't know if this is a better spray or a worse spray.

Â But let's talk about how we can test the difference between

Â different factor levels in this case using linear models.

Â And then, at the end I'll talk about some shortcomings of the approach that I'm

Â proposing here.

Â But here's a violin plot.

Â And let me just do head Insect sprays

Â to just show you the data, what it looks like, to see we have a bunch of counts.

Â And then, the spray label's a very simple data set.

Â And so, let's look at what happens when we include insect spray as a linear model and

Â y as an outcome.

Â 9:00

So, let's fit our model.

Â And now, what we're fitting is,

Â our outcome is the count, the number of insects.

Â Our predictor is the spray, which spray was used as a factor variable.

Â It's already a factor variable.

Â And then, I give it the data-set.

Â And then, here, I just want the summary of the output from lm.

Â Again, normally you wanna assign your lm to a variable so

Â you can keep it for later.

Â And then, I just wanna, for, to keep the printing a little bit self-contained,

Â I'm grabbing the coefficient table.

Â 10:11

And if you look over here at our plot, that seems about right.

Â Look at our violin plot.

Â 14.5 seems about right for spray A.

Â And spray B, it seems reasonable that it would be off by,

Â it would be changed just by a little bit from spray A.

Â Now, spray C looks like it has a much lower count, okay?

Â And look, it's coefficient is minus 12.

Â Okay? And that looks like about right.

Â So this one's at 14.5.

Â And somewhere around two seems about right for this one, spray C.

Â And so, that's exactly what this coefficient is saying.

Â This negative 12 here is the different between spray C minus spray A.

Â 11:13

If I were to take the average count for the sprays,

Â for those with spray A, I would get 14.5.

Â If I were to take the average count for

Â spray B, I would get 14.5 plus 0.833.

Â So, I'd like now to show you how I can hard code the same model and

Â not rely on r to actually pick the reference level.

Â So, remember what I did last time is I did count was my outcome and

Â my factor variable spray was my predictor.

Â 11:52

And what r does is it picks the spray level that's the lowest alphanumerically.

Â So, in this case, spray level A, to set as the reference level.

Â So let me show how you can hard code that myself manually.

Â So, here count is my outcome.

Â And then, I'm gonna create a variable using the I function which in lm

Â actually performs the operation inside the regression, inside the model statement.

Â So, here I just wanna look at the instances where the spray is equal to B.

Â And then, I multiply that times 1 to change it from Boolean to numeric.

Â And then, here's a variable that's 1 when spray is C, and 0 otherwise.

Â And here's a variable that's 1 when the spray is D, and 0 otherwise.

Â And here's one for E, and here's one for F.

Â So, I've included all of them except A.

Â So, I've forced A to be my reference level.

Â And I'm going to run this model.

Â And it should give me the same result as with r did.

Â It's just now I've shown you exactly how r is creating the regression variables.

Â So, let me just remind ourselves what r gives us when we run,

Â and let it handle the factor variable by itself.

Â And then, let me do the same thing where I've created my own factor variables.

Â And then, you can see 14.5, 14.5, 0.833,

Â 0.833, you can see that it's identical.

Â So this is what r is doing behind the scenes.

Â And let's keep exploring this because this is kind of an important point.

Â If you mess this up with factor variables, you get very incorrect conclusions.

Â 13:58

And the reason for that is because it's redundant.

Â We have six means, right?

Â For six sprays.

Â And we have seven parameters in intercept.

Â And then, now I've tried to put in six regression parameters.

Â I have six means to fit seven parameters.

Â It can't do that, so it's gonna drop one of them.

Â 14:20

Now, what if I do want my coefficients,

Â instead of being interpreted as levels referenced to a control level,

Â what if I want my coefficients to be the mean for each of the groups?

Â Well, you can do that, but you have to remove the intercept.

Â So, watch what happen when I say count is my outcome, and spray is my predictor, but

Â I remove the intercept.

Â 14:47

is that now I get a different set of coefficients, one for each spray level.

Â So, it includes A, B, C, D, E and F.

Â It hasn't dropped any levels.

Â And it can do that now because it has six parameters, and six means to work with.

Â And these are exactly equal to the means for each spray in the data.

Â So, if I were to just go ahead and calculate the means for each spray, right?

Â It works out to be the same numbers.

Â 14.5, 15.3, 2.08 in both, and so on.

Â Now, I want to emphasize this model is

Â no different than my model that included an intercept.

Â Why don't I go back to my model with my intercept just to illustrate this.

Â So, now it's just that the coefficients have a different interpretation.

Â Now, the intercept from the model, when I fit count as spray but

Â included the intercept, my intercept now is interpreted, 14.5,

Â as the mean for spray A.

Â And you can see that it's exactly the empirical mean for

Â spray A when I calculate the mean.

Â It works out that way.

Â And then, spray B, we talked about earlier,

Â was the comparison between the reference level spray A and spray B.

Â Okay? So, if I add these together,

Â 14.5 and 0.833, I should get the mean for spray B.

Â Okay, and that's what you see, 14.5 plus 0.833, that gets me 15.33, and so on.

Â So, If I add 14.5 and minus 12, I'm gonna get 2.08.

Â If I add 14.5 and negative 9, I get 4.9, and so on.

Â So, this model, where I've included an intercept,

Â has all the same information as the model where I omitted the intercept,

Â the only difference is how the coefficients are interpreted.

Â In the model with the intercept now the intercept is interpreted as

Â the sprayA mean and all the coefficients are interpreted

Â as relative to sprayA differences from sprayA.

Â And then if I would have fit it without the intercept then I get the mean for

Â each spray.

Â And if I want differences then I have to subtract the coefficients.

Â 16:49

And you usually want one of them to be a reference level because then you

Â can do test.

Â So now my p values are testing whether or not, for the t test whether or

Â not A is different from B, and A is different from C, and A is different from

Â D, and so on, whereas the p values from this test are just testing whether or

Â not those means are different from 0, which is a very different test.

Â Did sprayA kill any insects is what this is testing, where in this one,

Â the sprayB row is testing whether sprayA is different from sprayB.

Â So, what I'm trying to illustrate is that, how you play around with

Â factor variables in LM is very important in terms of how you interpret it.

Â It's not just a conceptual or theoretical thing to worry about it.

Â It's a very practical thing to worry about.

Â What your intercept means changes dramatically depending on what your

Â reference level is or whether or not you include an intercept.

Â 17:51

Where now I show you how you can re-level and

Â in this case sprayA was my reference level you can very easily re-level it.

Â So say sprayC is your reference level.

Â So now here I just use the re level command so

Â now inspect spray, the reference level is sprayC but

Â now I've just created a new variable where that spray2.

Â And now I'm gonna do my linear model where my outcome is my count.

Â And spray2 is my predictor now instead of spray.

Â And this is the one that has C as the reference level.

Â And then R knows not to do the one that has the lowest alphanumeric letter,

Â but instead has the reference level that I set.

Â And there when I do it notice sprayA is present,

Â sprayC is gone cuz now it's the reference level.

Â My intercept is interpreted as the mean for sprayC, and

Â you can see 2.0833 that's exactly the mean for group C.

Â And then this coefficient 12.41 here

Â is interpreted as the comparison of sprayA to sprayC.

Â This 13.25 is the comparison of sprayB to sprayC, and

Â so if I want to test sprayA versus sprayC I've got to look at this P value.

Â If I wanted to test sprayB versus sprayC, I would look at this p value.

Â So let me just recap, since this is a very important point.

Â 19:10

If we include a factor of that level, factor variable like spray in R,

Â then R automatically includes an intercept, and

Â treats the first level of the factor as the reference level.

Â So the intercept now is interpreted as the mean for that reference level,

Â in our example, the intercept is interpreted as the mean for sprayA.

Â 19:57

All of the tests then, the test for the intercept will be a test for whether or

Â not the mean for sprayA is zero.

Â The test of all the other levels, the sprayB, sprayC, and

Â other coefficients, will be a test for the comparison versus sprayA.

Â 20:22

And then all the test would be for whether or not the sprayA coefficient is different

Â from zero, the on for b would be whether or

Â not the sprayB coefficient was zero, and so on, which may or may not be relevant.

Â Usually you want the comparisons and that's why r's default

Â is to pick one of the levels and treat it as the reference level.

Â However, if you want a different one.

Â If you want B to be the reference level, you just need to use the re-level command.

Â Or if you wanna get involved a little bit more in linear models,

Â then you need to go into how you calculate standard errors in more general settings.

Â But that's a little bit more advanced.

Â For right now if you want to do the comparisons say between B and C then my

Â current suggestion is just to re-level so that now B is the reference level and

Â the coefficient for C will now be comparing spray B and spray C.

Â 21:10

I wanna give some caveats about this data analysis that I presented,

Â it's not exactly a complete data analysis I think the modeling the data

Â as if they are normally distributed is perhaps problematic they're count data so

Â they're bounded from below by zero.

Â I think it would probably be a little more natural to model this data as if it was

Â Poisson or at least over dispersed Poisson or

Â something like that which we're going to cover In our GLM version of the class.

Â 21:37

In addition the variance, it's clearly not constant.

Â So what I mean by the variance is I mean the variance around the mean.

Â And it's clearly not constant as our regression models would assume.

Â So this is a potential problem.

Â And so our means are probably correct.

Â Our estimates are probably correct.

Â But our inferences assuredly not.

Â So that creates an issue that needs to be handled at some level.

Â And later on in the class we'll talk about some things for handling this and

Â some of the rest you may have to take some further

Â statistical inferences classes to deal with some of the more advanced topics.

Â Like when the variances are unequal they call that heteroscedasticity.

Â