In this video, we'll motivate the need for model selection methods in the context of linear regression. If we don't have a strong theory that dictates the choice of which regression model that we should use., then that theory might come from physics often has theories about the way that certain variables relate together. There are other sciences like biology that also have some laws or strong theories. In the social sciences this is also possible, but maybe a little bit less common. If we don't have these theories, then we're confronted with a choice of how to select a model that best explains the data that we have and best makes predictions. The natural question arises, how do we decide which model to use? In answering this question, we really have to think about avoiding sort of two extremes, two different problems. Now the first extreme is having a model that's too small, so too few predictors. In this case you would have a model that underfits the data. Now a model that underfits the data just doesn't capture enough of the systematic variability in the data to be useful. Now, under fit models are biased and they don't take into account variables that are supposed to be taken into account. They don't actually explain the systematic variability. So you'll have a misspecified model and it would make poor predictions or explanations. Now the other extreme, the other thing that we want to avoid is having a model that is too big and that overfits the data. Now a model that overfits the data tries to take into account and model the random variability as systematic variability. This is a problem that tries to provide an explanation or a model for something that is not really explainable, it's a random. Now these models will also be biased and they'll take into account too many variables, and thus, again they'll miss specify the structural part of the model, and that will lead to poor predictions. While it might be intuitive to just use a model that has all of the possible predictors in it. So, just use the full model, sometimes people will call this the kitchen sink mode. Right? Everything gets thrown into the model. This often leads to overfitting and is thus undesirable. We need some way of selecting a model. Ideally the model selection process will help us avoid these two extreme cases of underfitting and overfitting. Now if you remember back to the beginning of this course when we introduced R squared, the coefficient of determination. You might be thinking, well, R squared is a goodness of fit metric. Why don't we use R squared to help us select different models? One idea might be to fit many models and use R squared as a way of selecting. The best model would be the model that has the highest R squared. But the problem with this idea is that R-squared always increases when you add predictors. The R squared method will always pick the model with the largest number of predictors. That will likely result in a model that overfits, right? That has too many predictors and is modeling some of the random variability instead of capturing just the systematic variability. One way to think about that is that R squared increases because whenever you add a predictor to a model and you perform least squares, if you add one additional predictor to the ones that you had in the previous one, your residual sum of squares will always go down. If the RSS goes down, your R squared will increase. We might also think that t-tests and F-tests can help with model selection. In some simple cases that's true, they can help. We've looked at such simple cases right where we have a choice between two models, a reduced model, and a full model. An F-test can help out in that case and a t-test helps us out in an even more simplified case where we're just deciding whether we should leave in one predictor or take that predictor out. But it's not always clear if we have a very large model space, space with many predictors. Why the two models specified in the F-tests are the most reasonable ones. There might be many many more. It's not clear why we would just specify an F-test with those two models when we have many other possibilities. You might think of an example. We have a model with a 100 potential predictors. The goal is to find the best model with some subset of these predictors. The possible combinations of predictors is really just enormous. This is especially true if you consider functions of any of the predictors, maybe you think a predictor squared should enter the model. Or if you consider interaction terms, the way that two predictors might interact with each other. It's really apparent at that point that F-tests and T-tests just aren't up to the task of selecting through that many different possible models. When we think about model selection, we really need to think about stepping through the space of possible models in some smart and incremental way. In some situations, the model space has a hierarchical structure and that can make stepping through easier. One example would be that if you have a model that has a predictor, but the predictor enters as a polynomial, so there are some higher-order terms like squared, cubed, or even higher-order. There's a general rule that you should not remove the first-order term before you remove the squared term or some higher-order term. You should move the higher-order terms first. When you have a hierarchical structure, you have a sense of what to remove when. In other cases, you might have a well-formulated and accepted, at least provisionally accepted theory in whatever scientific discipline you're working in. That theory might suggest certain variables should stay in the model. They have explanatory power according to the theory and they should be there. Such a theory can really greatly reduce the space of all possible models, and we should really rely on theory whenever we can, whenever it's possible. It's a good idea if data come into you and you have a model selection problem, it's always a good idea to do some upfront research, understand the scientific considerations that underlie the data that you have, and try to use some theory to simplify your model selection process. But without a hierarchical structure and without theory, statisticians and data scientists have opted for some other methods to try to help select some models. Very broadly we'll categorize these methods into two different categories. One is hypothesis testing methods. Methods that do use hypothesis tests like the F-test and the T-test, but in ways that help us iterate through the model space. Such methods include backward elimination, forward selection, and step-wise regression. Now we'll briefly talk about these methods, but they're actually not really statistically justified and so they're not great methods to use for model selection. Some statisticians are pretty critical of them, in fact. The reason why we'll briefly touch on them and show you how to use them is to show you why they're not that great, to actually give you some insight into why they're not well justified. But also people for better or worse than the literature in different areas of stats and data science, they'll use these methods and so you should have some familiarity with them and understand some better options. Now another set of methods that are helpful are what you could call criterion-based methods. These methods rely on some criterion that will help us balance between over-fitting and under-fitting. They try to find a model that fits well, but that doesn't allow us to take the biggest possible model or models that are too large. So they're penalized for models that are too big. This you can really think of as an operationalization of something like Occam's razor, which says that the simplest explanation is best. Some of these criteria, like the AIC or BIC, they allow us to balance between simplicity, so they favor simplicity in certain ways. But they're also models that fit well, so they provide this balance. It's also worth noting that there's an adjusted R squared that also allows us to compare models of different sizes. We can take our R-squared adjust it in a certain way and allow us to get some balance between goodness of fit and model complexity. Now, these methods aren't perfect of course, and they can be helpful, though, in some cases, in choosing a reasonable regression model. In the lessons to come, we'll dive into some of these methods.