Recall the concept of simple linear regression where the idea was to find the best fitting line through a group of points. The main goal of regression analysis is to test the validity of possible causal relationships. In other words, how does one variable influence another variable? How does X affect Y? With this lesson, you'd be able to discuss how regression can be used to make predictions like how changing price will affect your sales. We're also going to talk about errors, so that we'd be able to the list three main assumptions about errors. No money distributed assumption, errors almost got a CCT, auto correlation of errors. Finally, you will be able to identify what violations of these assumptions looks like on a graph. Beside measuring relationships, regression is also used to make forecasts and predictions. So assuming that you discovered a particular relationship, you can make what if statements to predict the outcomes. For example, what would be the impact of advertising in sales? From there, we can use a parameter estimates or the result from the regression to make what if statements. What if advertising increases by 10%? What would sales look like? Similarly, if price decreases by 5%, how much would sales increase by? So remember, what we've looked at is Y = a + bX + errors, where Y was the dependent variable and X was the independent variable that we want to use to explain Ys variation. The predictor, again, X would be used to predict the dependent variable. Now, I'm going to explain a bit more about error terms. So there are terms or errors as suppose to capture all the other factors that could make the dependent variable vary beside the proposed model Y = a + bX. Errors are usually seen as of statistical construct based on some assumptions. Understanding these assumptions is important to understand the validity of your results. These assumptions can be violated, and if they are violated, that means that there's something wrong with your analysis. And that you need to find a solution to address them. There are three main assumptions that are made about errors. Errors are normally distributed, there's homoscedasticity in the errors and there is no auto-correlation between errors. Let's start with normally distributed assumptions. So the normality assumption means that the errors are normally distributed. Meaning, they follow the normal distribution which is the root cause of many results in statistics. In this sense, it means that when both condition on the probability of various outcomes and in particular, on the distribution of forecast errors. What are the forecast errors? The forecast errors would be the difference between Y- Y-hat, where Y hat = a + bX, where a and b are going to be the parameter that you estimated through the regression. Here's a visualization of what this looks like. You can take actual Ys and then take predictive Ys. And If your errors are normally distributed, you should see errors who is a frequency that has two characteristics. One, you should see big pic somewhere meaning that the distribution will be uni-model. Two, [COUGH] you should have symmetry around this value. Visually this is how you could identify normally distributed errors. There are some statistical tests that could apply to test for normality of the errors after you made your estimation. One such test is called the Jarque-Bera test. Now, another assumption that is important is one called homoscedasticity of errors. It basically means that the forecasters are constants which means that the variance of the errors does not vary over time or across variables. Now let's look at this graph. If your data are normally distributed, and if the assumption of homoscedasticity is respected, you are going to have this curve. So the red x is going to be the predictor, and y is going to be value of the different variable that you try to explain. The red line is going to be the regression line, Y = a + bX. And the dots are going to be the forecast. If the model is correct, the forecast error should be somewhat the same for for any different axis. Now lets compare this to another graph. If you don't have almost homoscedasticity, you have an issue. It means that the variance of your errors is increasing. Here's an example where as X is increasing, the errors around the mean are increasing as well. So that would be a visual indication that you don't have homoscedasticity of errors. The last assumption is that errors do not display any obvious pattern. If you see an obvious pattern in the errors, then your errors are auto-correlated. So visually, if there isn't any problem, your chart will look something like this. On the X axis, you will have the independent variable and on the Y axis, you will have the dependent variable that you are trained to predict. You can see here that there is no obvious pattern between the errors. However, from this next graph, x again being the independent variable and y being your dependent variable, you have this big diagonal which is the regression line, the a plus bx in errors. Here proof, some kind of systematic pattern that looks like some kind of wave, right. So that would be a visual indication that your errors are auto-correlated. So to sum it all up, errors capture all the factors that could make the dependent variable vary beyond the model Y = a + bX. And if the errors are normally distributed, are homoscedastic and are not auto-correlated then you're okay. If the errors violate any of these assumptions then you have a problem