Now, let's look at the same data, but including Palm Beach County. Palm Beach County is this one observation all the way up there. It's called an outlier because its y-value is very far away from the regression line. This is very clear by looking at the residual plot on the right-hand side. Again, we see Palm Beach County being very far above the horizontal line. Such outliers should be examined because they could represent an interesting phenomena. They could also simply represent a typo, in which case, you may simply decide to remove it. This is another application of the residual plot because it makes it easy to spot those outliers. In fact, there was quite some controversy in the 2000 presidential election. The reason was that only Palm Beach County used the so-called butterfly ballot. And that ballot was suspected to confuse some voters to vote for Buchanan instead of the Democratic candidate, Al Gore. That may explain why Buchanan got such a large number of votes only in Palm Beach County. Now, you can see why a regression is quite useful in all kinds of situations. For example, you could use the residual here to estimate how many votes Buchanan got in error. So far, we only looked at y-values which are outlying. An x-value, which is far from the mean of the x-values, is said to have high leverage. The reason why the word leverage is used is because such a point has the potential to cause a big change in the regression line. Let's look at this toy example plotted here. There are four points and three of them follow a roughly linear pattern, but the fourth one is quite a bit apart. And moreover, it has a lot of leverage because it's far away in terms of x-values. What would happen if we fit the regression without that point? Here is the regression line that we get if we omit this point here. We see that this one point has a big influence on the regression line. Such a point is called an influential point. Whether or not a point is influential can only be told by refitting the regression line without using that point. For such an analysis, the residual plot turns out to be not very helpful. The reason is that an influential point may have a residual which is quite small, so it doesn't show up in the residual plot. The reason why it's quite small is because the point is influential in the first place. So, it pulls the regression line towards it. In fact, in this example, you see that the residual is quite small. Here are some other issues that you should keep in mind when doing a regression. Remember, the main purpose of regression is to predict. Predictions should not be done at x-values that are outside the range of the x-values that were used for the regression. The reason for this is that oftentimes, the linear relationship only holds for a certain range. We have no reason to suspect that it holds outside the range of the x-values which we look at. Sometimes, the data that are given to you actually come in terms of summaries such as average of some other data. Those summaries are less variable than other observations. And a consequence is that correlations tend to overstate the strength of the relationship. Finally, most regression analyses report a number which is called R-squared. This is simply the square of the correlation coefficient. The interpretation of R-squared is that it gives the fraction of the variation in the y-values that is explained by the regression line. So, 1 - r-squared is the leftover variation that's left in the residuals. A higher R-squared means that the regression line does a good job in explaining a lot of the variation in the y-values.