Okay. So we talked a little about the impact of excluding important aggressors right, that results and bias. And now let's deep down a little bit further in this other phenomenon, that if we include an important aggressors, that causes so called variance inflation. So I'm going to show this with a little simulation. Okay, so in this case I have a sample of size 100, and I'm going to do 1,000 simulations, so let me define those. Get some space down here, so we're just looking at this new set of code, okay. So I'm going to define those, and then we're going to define three regressors, they're all just three independent random normal vectors. And then what I'm going to do is simulate data, where the outcome y only depends on x1. An error, and then doesn't depend on the other two variables. And then I'm going to look at the standard error for the x1 variable for the model that just includes x1, that's this one. The model that includes x1 and x2, that's this one, and the model that includes x1 and x3. And I'm going to do that over and over again. And I'm going to look at the standard deviation of the simulated estimated coefficients. Now the reason I'm doing this by simulation, is because the variance inflation occurs on the actual standard errors, not the estimated standard errors. It's actually a little complicated, in that if I put another variable into a regression model, it doesn't necessarily inflate the observed standard error. And the reason is because the variance is estimated rather than the actual variance. And that has a kind of conflicting property. So you don't necessarily see the variance go up when you include a regressor in an regression, an unnecessary regressor into a regression model. You have to look at it through simulation. Let's go ahead and do this. This is sort of the ideal setting, where the three regressors don't have anything to do with one another. Now what you see is in fact. So the first model this is the standard error, the standard deviation of the beta coefficients I got crossed across all simulations. Here's the second one. It actually went down a little bit with the second one. Now we know theoretically that can't happen. But there is some Monte Carlo error when I do my simulation. So let;'s just say, these two are about the same. And, the third one is also about the same, 0.032 as opposed to 0.031. So nothing that bad. The variance inflation by including the extra variables did almost, was negligible. And the reason that is the case, is because these other to variables, x2 and x3, have nothing to do x1. I simulated them independently. Let's look at a case now, where I've simulated them in such a way that they depend heavily on x1. So I'm going to generate my x1 as I did the last time, just as a random normal vector. But my x2 depends on x1, but also some noise. And my x3 is heavily dependent on x1, very dependent on x1, okay? So now, let's repeat the simulation. And see what happens to the variability and the estimated coefficients. Okay, now we see huge amounts of variance inflation, especially for this third model where we include x2 and x3, right. We see the standard air has gone up by a factor of five, okay. And that goes to the general rule, If the variable that you include is highly correlated to the thing that you're interested in, right? You are going to inflate the variance more. And so, you need to concern yourself with putting highly correlated variables into your regression model, unnecessarily, okay? So if I have diastolic blood pressure in my model, and I put systolic blood pressure on my model, which is basically the same, relatively the same thing, they're very correlated. If I did that unnecessarily, that's going to make my diastolic blood pressure, standard air, it's going to increase it. So yeah, imagine for example another example is if you had height in a model and then you also included weight or something like that. Something that's highly related. Height and weight tend to track pretty well with one another. And so if that was necessary to include, then you just have to bite the bullet and deal with the extra standard deviation, the extra standard error that you get. However, if it's unnecessary, then you don't want to include it. So again, the main point is, is the more correlated the covariants are to the regressors that you're interested in, the worse off you're going to be in terms of paying a penalty for increased standard deviation. So that goes to show there's no free launch. If we omit variables that are important, we get bias. If we include variables that are unnecessary, we inflate standard errors. Especially if we include variables that are unnecessary that are uncorrelated with our variable of interest. And this again reiterates why randomization is so important, right. So it randomizes across these other variables, the known and unknown knowns that we would otherwise need to include in the model, so that we can only include the treatment of fact of interest. And things that we include even if, if we feel like we have to, they should be relatively balanced, or relatively uncorrelated with the treatment affect because it's been randomized. And so we won't get this kind of variance inflation that we would get otherwise. So again, that's why randomization is such an important quality. Now lets talk about variance inflation factors. So we don't know the actual sigma, so when we include, so we can't actually calculate the exact variance inflation. That depends on the actual true residual variance. However, if we can compare ratios of them, the sigma will dropout and then we can create things like ratios and variance inflation that accurately represent how much the standard deviation is inflated or not. So, there's a particular measure that is very commonly used in regression, called the variance inflation factor or VIF. And all this is, is the comparison, this ratio, the variance inflation that occurs from what you actually have. So looking at the standard error for the data that you have, versus the ideal setting where all the other covariates that you're interested in are orthogonal. To the covariant or let's just say uncorrelated. So remember, that was the ideal case, right? So if everything else was uncorrelated to what you were interested in we didn't see much variance inflation. This looks at the case that you're in, versus the ideal case where what you're interested in is uncorrelated with all the other variables. That's the so called variance inflation factor. And here I give a little simulation. I'm not going to show it. Where I just show you that the ratios can get estimated, and I show you a little simulation. That they can get estimated pretty well so, let's actually look with this Swiss data, let's look at the variance inflation you, so what I'm going to do is fit the model What happens if there's some residual effect of the first aspirin when you give the second one? I'm going to fit the model with all the variables included, so when you do a ~., it includes all the variables. And then I'm going to do vif(fit), that gives me my variance inflation factors. Oh, but, you have to remember to load this library, c, a, r., okay? Let's load that, and then now let's try it again. I would rather than the variance inflation factors, I would look at the inflation factors for the standard deviations, which is the square root. So let's look down here. So on the first line I have the variance inflation factor by itself and then in the line below it, I have the square root of it which I tend to prefer. So let's look at the first line just because it's the more normal thing, the variance inflation, so for the agriculture 1 is 2. What does that mean? It means the standard error for the agriculture effect is double of what it would be if it were orthogonal to all the other regressors, okay? And we see that some of them are quite high, examination is quite high, 3.67, so almost 4 times than the instance when if it were orthogonal to all the other regressors. And we know that education, I'm sorry, examination and education are highly related, so that's why these two have very high variance inflation factors relative to some of the other ones. And that, and you can see, take a variable like infant mortality, which has a very low variance inflation factor. Infant mortality is largely going to be unrelated to any of these other variables, by and large, and so you can see that it doesn't have a large variance inflation factor because it is pretty much unrelated to the other variables anyway, okay? So that's gets you a little bit of headway into the world of looking at what happens when you include unnecessary variables, and one of the ways to look at that is so-called variance inflation factors, and they're incredibly useful little tools and I hope you get a chance to use them. So up to this point all that we've talked about is the impact of unnecessary or necessary included variables on our regression coefficients. But there's one other parameter that we estimate in a regression model, the residual variant, sigma squared. So what happens if we have assuming iid errors in our linear model is additive, and that that part of the model is correct, then we can mathematically describe the impact of omitting unnecessary variables or including unnecessary variables. And it follows along the same lines as what we discussed before, if we underfit the model, in other words, we omit important variables, then the variance estimate is biased. Why is that? Because we've attributed to variation things that are actually systematically explained by these co-variants that we've omitted. On the other hand, if we either correctly fit the model, we include all the right terms, or if we overfit the model, then the variance estimates in unbiased; however, the variance of the variance estimate gets inflated if we include unnecessary variables. So it's actually kind of the same rule, if we omit, as we've discussed before with coefficients, if we omit variables then we get biased, if we include variables, then we get a less reliable estimate, so it's roughly the same impact going on. So let's talk about model selection in general, automated model selection. I'm going to fit the model with all the variables included, so when you do a ~., it includes all the variables. It really, I think, in one point was a statistical topic but it's really moved in to the realm of machine learning for the most part. And I would say, though, even for relatively simple linear regression models, the space of model terms that you have to search among explodes really quickly when you start including interactions and polynomial terms like the square of a regressor and so on. If you have a lot of regressors and you're interested in how do I reduce this space, then there's a lot of factor analytic and things like principal components, those kind of techniques that are available to you, to reduce your co-variant space down to size. Now however, those come with consequences. Your principal components or factors that you obtain might be less interpretable than the original data that you're interested in again. Again, this is probably better served in a multivariable class, a multivariate class, or a class on machine learning. But for us we're going to mostly consider the case where we have a relatively small number of regressors, and we're going to pick through them with a highly interactive process between the analysts, the data, and the scientific context. Another thing I would mention is that good design can often eliminate the need for a lot of this model discussion. We've talked a lot about how randomization can really prevent a lot of the problems that we're talking about with making our variable of interest unrelated to nuisance variables that we're not interested in, or in nuisance variables that we don't even know about; however, there's other aspects of design that can serve the same purpose. For example, if we stratify and randomize within strata. The classic example of this, when this was developed, was R.A. Fisher was working in field crop experiments and they needed, let's say you were trying a different kind of seed, you might block on different areas of the field that you were going to plant in and randomize the different seeds to those areas. So you might have two different kinds of seeds, but they will have been distributed in a systematic way that is fair across the field, but also then, within that design, there will also be some randomization. This topic of experimental design is a pretty broad topic. Another great example is in biostatistics, the field that I work in the most, a very common kind of design is a cross-over design, and in that case, you try to use every subject as their own control. So let's say, for example, you're interested at looking at two different kinds of aspirin, and you might give the aspirin to one group of people, and then the aspirin to another group of people, and the other aspirin to a different group of people. Let's say they have different gels or whatever that determine how much it, how it gets absorbed in your stomach. So if those two groups aren't the same, either the randomization wasn't very good and there was some certain imbalance that you just got unlucky about, or if the study was just observational, then the comparison of those two groups might be biased by whatever differentiates the groups rather than group one receiving one kind of aspirin and group two receiving a different kind of aspirin. On the other hand, if you can give a person one kind of aspirin and then later on give them a different kind of aspirin when they have another headache, that would compare each person to themselves, right? Control block on the person so to speak. So that's a design strategy. Now there's some nuance with this design strategy as well. What happens if there's some residual effect of the first aspirin when you give the second one? So maybe you could handle that with some sort of wash out period, long wash out period or something like that. But anyway, the point of that design is to make it, so that you're comparing people with themselves to control and everything that's intrinsic to the person. These across time periods control for that by giving both aspects to each person. Maybe you would randomize the order in which they received them that's called a crossword design. At any rate, the broader point that I'm trying to make is, it's often the case that good thoughtful experimental design can really eliminate the need for some of the main considerations that you would have to go through in model building if you were to just collect data in an observational fashion. [SOUND] The last thing I would say is there's one automated search model technique that I like quite a bit and I find it very useful and it's the idea of looking at nested models. So, I'm often interested in a particular variable and I'm very interested in how the other variables that I've collected will impact it. So, I'm interested in a treatment or something like that. Some important variable, but I'm worried that my treatment groups and imbalanced with respect to potentially some of these other variables. So what I'd like to look at is the model that just includes the treatment by itself and the model that includes the treatment and let's say, age. If the ages weren't really balanced between the two treatment groups and then one that looks at age and gender, if maybe the genders between the two groups weren't really balanced and then so on. And this idea of creating models that are nested, every successive model contains all the terms of the previous model leads to a very easy way of testing each successive model. And these nested model examples are very easy to do, so I'm just going to show you some code right here on how you do nest and model testing in R. So I fit three linear models to our SWF dataset, the first one just includes agriculture. Let's pretend that, that's the variable that we're interested in and then the next one includes agriculture and examination in education. I put both of those in, because I'm thinking they're kind of measuring the same thing. But now after this lecture, I'm concerned over the possibility that they're too much of measuring the same thing, but let's put that aside for this time being. And then the third model includes Examination + Education + Catholic + Infant.Mortality. So, all the terms. So now, I have three nested models and I'm interested in seeing what happens to my effect as I go through those three models. The point being, in this case, you can test whether or not the inclusion of the additional set of extra terms is necessary with the ANOVA function. So I do anova fit1, fit1 and fit5. That's what I named them, one, three, five. And then you see down here, what you get is a listing of the models. Model 1, model 2, model 3 and then it gives you the degrees of freedom. That's the number of data points minus the number of parameters that it had to fit. The residual sums of squares and the excess degrees of freedom of going Df is the excess degrees of freedom of going from model 1 to model 2 and then model 2 to model 3. So we added two parameters going from model 1 to model 2, that's why that Df is 2 and then we added two additional parameters going from model 2 to model 3. So the two parameters we added from going from model 1 to model 2 is we added examination and education, they're two regression coefficients. Going from model 2 to model 3, we added Catholic and Infant.Mortality there to regression coefficients. With this residual sum to squares and the degrees of freedom, you can calculate so-called S statistic. And thus, get a P value. This gives you the S statistic and the P value associated with each of them, then here it shows that yes, the inclusion of education examination Information appears to be necessary when we're just looking at agriculture by itself. Then I look at the next one it say, yes, the inclusion of Catholic and Infant.Mortality appears to be necessary beyond just including examination, education and agriculture. So if the way in which you're interested in looking at your data naturally falls into a nest model search as it often does, I think when you're interested in one variable in specific. As in this case, I think this would be a pretty natural way of thinking about the series of analyses, then some kind of nested model searches a reasonable thing to do. It doesn't work if the models that you're looking at aren't nested. For example, if I had the first model or model 2 had an examination, but not education and the third model had education, but not examination. This wouldn't apply, you'd have to do something else. And there, I think you get into the harder world of automated model selection with things like information criteria. So I would put all that stuff off to our prediction class and just leave you this one technique that's useful in the one specific instance where you've decided to kind of look along a series of models, each get increasingly more uncomplicated, but including the previous one. So, I hope in this lecture that you've gotten a couple of model selection techniques that you can use. I hope you've also learned that there are some basic consequences that occur, if you include variables that you shouldn't have or exclude variables that you should have. These has consequences to your coefficients that you're interested in, they have consequences to your residual variance estimate. We didn't even touch on some other aspects of [INAUDIBLE] model that could occur, such as absence of linearity and other things like that, non-normality and so on. So again, it's generally necessary to take your model that's the a grain of salt, because more than likely one aspect of your model is wrong. And I'll leave you then with this famous quote by George Box who vary famously said, all models are wrong, some models are useful. And I think that's a very credo to go along with that yes, for sure your model is wrong, but it might be useful in the sense of being a lens to teach you something useful and true about your data set. So that ends today's lecture and I look forward to seeing you for the next one.