[MUSIC] In this lecture, we're going to look at multiple linear regression. It's a simple extension of what you've been doing but there are a couple of additional aspects that need to be considered. You've so far determined that both age and lung function are predictors of walking distance. Lung function explains much more of the variability in the observations, as the adjusted R-squared is 21%, whereas for age it's just 4%. The next natural question is, if we use both of these variables, will it improve our model for walking distance? Well, we can examine if it does by fitting a multiple regression model and the instructions for doing this in R follow this lecture. So on the left hand side, you can see the results from the original univariable regression and on the right hand side, the results are for the multiregression model, and you can see both coefficients have reduced slightly. FEV1 has changed from 74 to 71, and age has changed from -3.1 to -2.5. So which of these three models would you choose and why? They've changed value because they're now adjusted coefficients. FEV1 in the multiple model is now adjusted for the effective age. So you can interpret this as the change in walking distance expected for a one unit increase in FEV keeping age- health constant. If the predictors were entirely independent, then the estimates in the multiple model would be the same as a univariable model but, in practice, variables are often correlated to some extent. So if you think about age and lung function, we would expect some kind of association even if this is only a week. But if the correlation between two predictors is strong, this can cause a real problem when modeling and this phenomenon is called collinearity and it refers to a strong linear relationship between two predictors. It causes a problem because the two variables are both explaining the same variance in the observations, which means the variance can't be petitioned between these two competing predicates. So as an example, let's take a quick look at walking distance as a predictor for quality of life. It's a slightly artificial example but it demonstrates the point well. So using SGRQ as the outcome, if we fit two regression models, one using the first six minute work test as a predictor which is the variable MWT1, and the second model using MWT2, and you can see the results of the regression model here. And they show that walking distance is predictive of quality of life, And the further a person walks, the better their quality of life is. And you'll notice that both the P- values are highly significant. But we expect them be correlated because the first and the second walk test are measuring the same thing and you can see that in the scatter plot of the data. If we include both of these variables in one model, what do you think will happen? Well, the model won't physically explode but metaphorically it does. In the right hand side, you can see the results from the multivariable model and what you'll notice now is that where both variables were highly significant predictors of quality of life before, neither of them are now. The standard error of both coefficients is huge, And as a result the 95% confidence intervals are wide. So this multivariable model has given us an entirely misleading impression of the relationship between walking distance and quality of life, as it indicates they're not associated. So this is why we need to take care when fitting models. Had we not examined our data, we could have easily fit this model and not realise that anything was a matter. You might have noticed this one thing we've not done this time, and that's check the assumptions, and I'll leave that for you to do as an exercise. So in this lecture, you've seen how to extend the model to include more than one variable and how this provides estimates to R adjusted for other predictors. I've also demonstrated the importance of examining the data to identify potential problems such as collinearity. So all the variables we've considered so far have been continuous. Next, I'll show you how to include binary and categorical predictor variables. [MUSIC]