[MUSIC] In the previous video, we have understood what is a regression analysis in order to deal with metric data or basically interval data. We're going to dig deep more into regression analysis and consider the different types of things we can do with regression analysis. So, just to recap. Under linear regression, we tried to measure the effect of your independent variables, remember the xs, on your dependent variable, which is the y, and then you have certain error terms, which are the es. The first idea is to estimate the unknown or the unobserved variables, A and B, which are the intercepts and that coefficients. So, what method do you use in order to estimate these A and B terms? Usually the most simple method is called ordinary least squares or OLS. So, the idea behind ordinary least squares is to find the value of a and b for which the sum of square of the inner terms, is least. So the biggest concept is, how do you minimize the inners overall, in order to capture the real effect of the xs on the ys? Now, once you have these a and b terms estimated, how do you go and dig deeper into these coefficients? So first concept is to how do you assess the fit of the model. That is how effectively have you captured the causal effect of the xs on the ys? So this fit depends on a few things. The most important thing is called the R square which is the amount of variance of y which is explained through the regression and in terms of those independent variables which are the xs. Usually R squared takes the value between 0 and 1 and R square increases when adding an independent irrelevant variable over the set of variables which you already have. Of course, the more variables you have as your independent variables, you expect the model to fit better. Unfortunately, there might be situations of overfiting the data when you have some irrelevant independent variables. Also it's important to look at how your predictive accuracy in the model is and in terms of your R square, whether you can predict the model better with more set of independent variables. So how do you do predictions? Once you know the as and bs we can predict ys for any set of values of x. How do you do that? So basically let us again think about the regression equation which is y and then you have in terms of all the independent variable, xs and then error term. Basically what prediction does is you try to predict how changing one or two values of the xs affects your ys. Why do you do predictions? Mainly because you want to know how does your model do out of sample. That is, how well does your model predict observations which are not there in your sample which is out of your estimation sample. Usually, what we do in case of out of sample predictions, is use a hold out sample, and compare the predicted values of the dependent variables with the actual values. You can also use prediction to understand a what if analysis. So, basically for example, what will be the sales of y when we set the prices of x to a certain level? The third use of prediction analysis is to understand optimization. That is again for example, what price or price range of excess are expected to give you certain level of sales or certain level of profits? The third use of predictions is for hypothesis testing. So, remember for categorical data, we only talked about how to do hypothesis testing. Again for metric data, we didn't want to know what do you do in order to test your hypothesis. So again, let's look back into the regression equation, that is y as a function of your xs and also your a and b which are your coefficients and your intercept. Now the question is the hypothesis testing is going to measure whether there's a statistically significant effect of an independent variable, which is your x, on your dependent variable which is your y. For example, let's think about x as price and y as your sales. In this case, your null hypothesis again is going to be one where you do not expect any effect. That is, your coefficient, maybe your b is equal to 0. You usually use a T statistic that is the statistic from a T distribution in order to measure this hypothesis testing. And with certain degrees of freedom. So basically again, your degrees of freedom measures how many independent terms you have compared to the number of observations you have. You also have to measure the standard error of this coefficient. Finally, as we talked about earlier for categorical variable, we have to evaluate the p value based on the statistic. If the p value is really small, that is less than your degree of significance. Remember the alpha we discussed, you reject the null and so you conclude that there is evidence of a causal effect of the x variable on your y, and the effect is significant. There are certain issues, however, with linear regression, which you have to be careful when running this kind of a model. First one is multicollinearity. What is multicollinearity? When independent variables are correlated with one another, then including all these independent variables together might actually lead to biased results. So, that's why you have to be careful that your independent variables are not multicollinear. Second, in this case, you have to delete independent variables, which are probably very much correlated with one another. Also, another solution to multicollinearity, might be combining the related variables using something like a factor analysis, which we discussed under categorical data. Then, you have to also remember, is the relationship between your xs and your ys linear or non linear? Because, remember, we can only use linear regression when you have a linear relationship. However, you might not be in a situation where you have always a linear relationship. You can have a non-linear relationship. How do you deal with that? Or, how do you expect to measure that whether the relationship is linear or not, you probably have to plot your residuals, which is a difference between your actual ys and your predicted ys. Versus the xs, which are your independent variables, to check the deviations from the true values. In this case, of course, if they are not deviated much, then you're fine. However, if they are deviated, then probably instead of linear regression, you have to use non-linear regression. The third type of problem with linear regression analysis is the possibility of outliers. So whether there are any extreme values of the residuals which we have to take into consideration. The final thing which we have to consider when learning a regression analysis is the prediction of outside the data interval. Remember when you have a regression analysis, yes, you can always use it to predict out of sample but sometimes extrapolation outside of the sample can be very dangerous. For example, limited variation in the independent variable or maybe it sometimes leads to insignificant effect. So be careful when doing a regression analysis and then trying to predict out of sample or extrapolate. So these are the basic ideas when you run a regression analysis, more specifically a linear regression analysis. Probably you could extend these ideas to do also non-linear regression analysis, which will be much more complicated but the method essentially is very similar, thank you. [MUSIC]