In this video we will define the least squares line and also talk about how to calculate and interpret the slope and the intercept of the line. We can't simply add up all of the residuals because some of them are going to be negative and some of them are going to be positive, depending on whether the model is over or underestimating certain data points. So we need to come up with a more clever approach. One option is minimize the sum of magnitudes, or in other words absolute values of their residuals. Another option to minimize the sum of square residuals, and this is what we called the least squares, and this is the option that we're going to be sticking with. So, why least squares? This is, indeed, the most commonly used approach. And it's also easier to compute by hand as well as using software. But most importantly, in many applications, a residual twice as large as another is more than twice as bad. We had used the same idea when we calculated the standard deviation earlier in the course is why. This is the general form of the least squares line. We have our explanatory variable x, that gets multiplied by this slope beta 1, and we also have an intercept where the line intersects the y axis. And finally we have y hat which stands for the predicted value of the response variable. Before we talk more about how we actually come up with this line, let's first focus a little bit on the notation. Once again we have data from a sample and we are going to be using that sample to estimate unknown population parameters. The unknown population parameter for the intercept is beta 0, and the point estimate counterpart of it, the observed value, is b,0. So once again, we're using the Greek alphabet, Latin alphabet approach for denoting our parameters and point estimates. Similarly, for the slope, the parameter is beta 1, and the point estimate is b,1. So how do we estimate the regression parameters? Let's start with the slope. Remember earlier, we said that this is a least squares line. In other words, we're minimizing The sum of squared residuals. To minimize sum of squared residuals, we could actually use a little bit of calculus and calculate the slope and the intercept using that approach. However, since this is not a calculus-based course, we'll actually introduce some shortcut formulas. So we can calculate the slope b,1 as the standard deviation of y divided by the standard deviation of x you may have heard of this as rise over run times R the correlation co-efficient. Let's illustrate this with an example, the standard deviation of percentage living in poverty is 3.1% And the standard deviation of percentage of high school graduates is 3.73% in our data set. Given that the correlation between these variables is -.75. What is the slope of the regression line for predicting percentage living in poverty from percentage of high school graduates? First, let's parse through the information that's given us in the problem. We're told that the standard deviation of percentage living in poverty is 3.1%. So we can say that s,y is 3.1%. Because remember, poverty is our response variable. And we are also told that the standard deviation of percentage of high school graduates is 3.73% so that x is 3.73%. We are also given the correlation co-efficient as negative 0.75 so putting all of these in the formula For the slop. B1 equals Sy over Sx times R. We simply need to plug in the numbers, 3.1 divided by 3.73 times negative 0.75, gives us a slope of negative 0.62. Note that the sine of the slope is always going to be equal to the sine of your correlation coefficient. Conceptually speaking, this is true, because we're clearly seeing a negative relationship between the two variables so it makes sense that the slope is negative. Mathematically speaking, remember that the standard deviation is the square root of the variants. And it's a measure of variability. So the standard deviation of y and x are always going to necessarily have to be positive numbers. In the first part of the equation, we're dividing two positive numbers, which is always going to yield a positive response. And then we're multiplying by a value that could be negative or positive, depending on, on the direction of the relationship between the two variables. So mathematically speaking as well, the sign of the slope is always going to be the same as the sign of the correlation coefficient. Calculating the slope by hand is clearly very simple and actually sometimes unnecessary as well because often times, we don't calculate these values by hand, but we simply use computation. What is really important is how do we interpret this number, negative 0.62. For each percentage point increase in high school graduate rate. We would expect the percentage living in poverty to be lower on average by 0.62% points. There are few things to remember here first the interpretation of the slope is about the relationship between your explanatory and your response variable. In other words, how do we expect the response variable to change As we increase the explanatory variable by 1 unit. When we're interpreting these, we also want to make sure that if we are dealing with an observational study; such as the one we have here, we avoid causal language. So this is why we're saying that we Expect this to happen on average as opposed to interpreting this value as something. Like if you increase high school graduation rate by 1 percentage point, you would be able to decrease poverty by .62 percentage points. Next, we want to estimate the intercept and remember that the intercept is where the regression line crosses the y axis. For this, we're going to make use of the property that the least squares line always goes through x bar, y bar. In other words, it's always going to go through the mean of y and x. We know that we can write the linear model as y hat equals b0, the intercept, plus b1 times x. And all we need to do now is to plug in x bar and y bar in our equation because we know that the line has to go through this point and rearrange things a bit to get the formula for the intercept. So by rearranging a bit, we can see that the intercept can be calculated as the average value of the response variable minus the slope, which we already calculated in the first step, times the average value of the exponetory variable. Let's see how we can do that. Given that the average percentage living in poverty is 11.35%, and the average percentage of high school graduates is 86.01%. What is the intercept of the regression line for predicting percentage living in poverty, from percentage of high school graduates? One value that's given to us is the average value of the response variable, that's 11.35%. Another value that's given to us is the average value of the explanatory variable, 86.01%. And we also know that we can calculate the intercept as simply the average value of the response variable, minus the slope times the average value of the explanatory variable and we calculated the slope in the previous step as negative .62 as well. So we have all the building blocks we just need to plug them into the equation and the intercept comes out to be 64.68%. Once again, the calculation is very simple once you have these givens. And most of the time you're not going to need to do this calculation by hand. What is important is to understand 2 important points. 1, that the regression line always goes through the center of the data. And 2, how do we interpret the intercept? Remember, the intercept is where we said the regression line crosses the Y axis. In other words, it's the expected value of the response variable when the explanatory variable is equal to zero. So, in context, what we can say is that, states with no high school graduates Are expected on average to have 64.68% of their residents living below the poverty line. Does this seem realistic, that there would be states in the U.S. with absolutely no high school graduates? Looking at the data we have that actually seems very unlikely. We can see that all the states in the US actually have high school graduation rate varying somewhere between 75% to about 95%, maybe. So mathematically speaking, this is a construct that is important for putting together our linear model. However, in context, it is not a very useful number. So putting the information together from the previous two steps, we can write our linear model as a predicted percentage living in poverty, is equal to 64.68, the intercept, minus 0.62, the slope, times the percentage that are high school graduates. If instead of calculating these values by hand, we had actually used computation, the regression output would look something like this. Depending on the software you're using, you might get slightly different formatting, but usually this is the general constart for the regression output. We can see our inner set. And we can see that it's slightly off from what we calculated and that's probably simply due to rounding and we can also see our slope as the estimate that's associated with the explanatory variable. We are going to talk about what the standard are, the t score and the p value mean later on in the course. So for now let's just focus on the estimates column, where we can find what we call our parameter or coefficient estimates for the slope and the intersect. So to recap, we interpret the intercept as when x equals 0, y is expected to equal the intercept. As we discussed, this may be a meaningless value in context of the data, and in those cases it might only be serving to adjust the height of the line. The interpretation of the slope is slightly different, since it's about the relationship between the two variables. For each unit increase in x. Y is expected to be higher or lower on average by the value of the slope. So this basically tells us as we increase x by one unit, what do we expect to happen to y? And once again, depending on the type of study you have, you want to be careful about interpreting the slope in a causal versus a correlational way.