[MUSIC] Since I've seen you last you've chosen your outcome of interest from the dataset that interests you the most and you've had a chance to think about your research question and what predictors for this outcome might be. A common problem that researchers face when developing a model is how to select the variables to include. Ideally we'd like to include all variables of interest, but we're limited by the number of observations in the dataset. There has to be a sufficient number of observations per predictor variable to allow the model parameters to be appropriately estimated. If we include too many, we may risk overfitting the model. So overfitting occurs when there are too many variables included in the model and the model fits the data too closely and we end up just explaining random error rather than real relationships. So for this situation, when there are more variables of interest than the sample size will allow, we need to decide how to select the variables for inclusion. So automated approaches to this problem exist and a common approach is to use stepwise regression. There are three different algorithms that can be used in a stepwise approach and the principle behind them is to identify the predictors with the greatest influence on the outcome as the criteria for being included into the model. With forward selection the approach starts off with no variables in the model but continues to add variables one at a time if the variable makes a significant improvement to the current model and improvement could be measured through various criteria and often the p-value is used. With backward selection, you may guess it from the name, we start with a full model, including all of the variables and then the least significant variable is removed after each iteration. And finally, a step-wise approach is based on the forward selection but after adding each new variable, the model is tested to see if any of the previously included variables can be removed. So whilst these approaches are an advancement on the manual screening of variables at the uni-variable level, they still suffer significant drawbacks, only a few of which I've listed here. So these problems originate from the fact we expect variability in our sample estimate of our population parameter. So by only selecting variables that reach a specified threshold in one particular sample, means that we'll be biased towards selecting those that, by chance, are significant in this sample. So this results in the regression coefficients and the r squared value being biased high and also in the standard errors and the p value as being biased low. In addition to that, the algorithm only really works well if all the variables are completely independent. And we know in practice that most variables are correlated to some extent. So the results will in part rely on the order in which the variables get considered for inclusion. So if there's some degree of collinearity between two or more variables, it's the first of these variables to be included that will be explaining the overlapping variance, and the second variable will seem less influential. So if you read the published literature, you will see step-wise regression is still commonly used in pockets of research, despite these significant limitations. But just because it is used, it doesn't mean it's any good or that you should use it. My advice is to steer clear of univariable approach into screening of automated methods. Instead, think about the problem. In the next lecture, we'll look at how to develop a modelling strategy, in which I'll encourage you to use a more thoughtful approach to variable selection. [MUSIC]