In this video, we're going to talk about the correlation between two numerical variables. We're going to define what correlation means and we're going to go through its properties. Here we have a scatter plot of poverty rate versus high school graduation rate. These data are from 2012, and in 2012 the poverty line in the US was defined as having an income below $23,050 for a family of 4. The response variable here is the percentage living in poverty. Note that this is the variable on the y axis. The explanatory variable is the percentage of high school graduates or the high school graduation rate. The relationship between these variables is linear, negative and moderately strong. When we discuss the relationship between two numerical variables, we always talk about the form, usually we worry about is it linear or non linear. The direction is at negative or positive, and the strength going from very weak to extremely strong. One measure of the strength of the association between two numerical variables is correlation. In fact, correlation specifically describes the strength of the linear association between two variables. The key word here is linear. So we only measure the linear association using correlation. We denote correlation with an R. Next, we're going to go through the properties of the correlation coefficient. First, the magnitude or in other words, the absolute value of the correlation coefficient. Measures the strength of the linear association between two numerical variables. Here we can see three scatter plots. And the strength of the association is going from pretty strong to very weak. For the first plot, the correlation coefficient is 97. For the second one its 0.69 and for the third one its 0.07. We can see that higher the magnitude the stronger the strength of the association between the two variables. Two, the sign of the correlation coefficient indicates the direction of association. So here we have two scattered plots, one with a positive association where the correlation coefficient is 0.98, and one with a negative association where the correlation coefficient is negative 0.96. Three, the correlation coefficient is always between -1, which is a perfect negative linear association and positive 1, which is a perfect positive linear association. And the correlation coefficient of 0, indicates no linear relationship. Here we have three scatter plots again. The first one shows a positive perfect linear association. The second one shows a negative perfect linear association. And then the third one the correlation coefficient is 0. As x increases nothing is happening to y. Therefore there's no relation In shape between these two variables. Four, the correlation coefficient is unitless and is not affected by changes in the center or scale of either variable, such as unit conversions. So here we have two scalar plots, the data comes from mammals and on the y axis we have total number of hours of sleep and on the x axis, we have body weight of these mammals. The first scatter plot shows the body weight in kilograms and the second scatter plot shows the body weight in pounds. And remember that one kilogram is roughly 2.2 pounds. We can see that the shape of the relationship of the two variables, looks very similar, in fact without the access labels it would very difficult to tell the difference between these two plots. For both of these plots the correlation coefficient is -0.34. Now one might think about is it even appropriate to calculate a correlation coefficient here? Because the relationship between these two variables do not appear to be linear, and that's a very good point. And we probably wouldn't want to claim a linear relationship between these two variables. But to demonstrate the fact that changing the units does not affect the correlation coefficient. These plots still serve a purpose. Five, the correlation of X with Y is the same as of Y with X. So what this means is that even if you swap the axis, your correlation coefficient should stay the same. Here we're looking at in the first plot, the total number of hours of sleep on the Y axis versus life span of these mammals on the X axis. And for the second plot, we simply swap the variables. So now, the response variable is life span and the expository variable is the total number of hours of sleep. For both of these, the correlation coefficient is -0.38. So changing the variables around does not affect the correlation coefficient. Six, the correlation coefficient is sensitive to outliers. Here we have two scatter plots again, and the first one there is no outlier we have extended the x axis all the way to 50. So that we can do a comparison and in the second one there is one straight point that's an outlier. The first plot shows a pretty strong relationship between the two variables and in fact the correlation coefficient is close to perfect. It's roughly 0.98, and the second plot we've simply moved one of the data points from this original data set further away from the rest of the cloud. And the correlation coefficient has gone down to 0.68. So we can see that even with one outlier, the correlation coefficient because it's sensitive, it will change greatly. So we can see that moving even one data point to be an outlier will effect the correlation coefficient greatly because it is so sensitive to outliers. Let's take a look at this practice question. Which of the following is the best guess for the correlation between percentage living in poverty and percentage of high school graduates? Note that we haven't provided a formula for the correlation coefficient. There of course is one. And you will get to use computation to calculate correlation coefficient, but there's absolutely no reason in this day and age to try to calculate that by hand. However, given a bunch of choices, we should be able to pinpoint which of these following sounds like a reasonable guesstimate for the correlation between these two variables. First off, we can get rid of 1.5 or -1.5 right off the bat. Because we know that the correlation coefficient can only be between negative one and positive one. We also see that the relationship between these two variables is negative. Therefore, any positive correlation coefficient doesn't make sense here. So next we need to choose between negative 0.75 and negative 0.1. Note that negative 0.75 is much closer to negative 1, meaning that it indicates a much stronger relationship. So the question becomes, do we see a strong relationship here, or a pretty weak relationship? Sometimes it helps to look at the negative spaces on our plot. So we can see for example, that there are some negative spaces on our plot and if we were to block those off it would be a little easier to see that there is indeed a somewhat strong relationship between these two variables, even with all the scatter around the line. Therefore, the correct answer here is going to be -0.75. A correlation coefficient of negative 0.1 would look like much more of a random scatter that takes place of the entire plot without leaving any negative spaces for us to get rid off so that we can better see the linear relationship. Another practice question. Which of the following has the strongest correlation? In other words, the correlation coefficient is closest to positive 1 or -1. The first plot shows a very strong relationship, but the relationship is not linear. Remember that we can determine the strength of the relationship looking at how much scatter there is in the data. Because there's very little scatter here, we can see that the relationship is strong. But once again, we wouldn't expect the correlation coefficient to be very close to positive or negative 1, because the relationship is not linear, and the correlation coefficient measures the strength of the linear relationship. The last plot shows a weaker relationship, and the second to last plot shows an ever weaker relationship. So the strongest linear relationship here is option b. So this one should be the scattered plot with a correlation coefficient closest to positive one in this case since their relationship is positive.