Now that we have a good grasp of measuring centers of distributions, let's talk a bit about measures of spread. In other words, statistics that tell us about the variability in the data. Take a look at these two curves, with the same center. The skinny blue curve is less variable, than the wider green curve, as the data in the blue curve are clustered closer to the center. While the data in the green curve are spread further away from it. One measure of spread, is the range, which is simply the difference between the minimum and the maximum values in the data. While it's easy to calculate, this is not a very reliable measure of variability of the sample, since it depends on the two most extreme values, the end points of the distribution. More reliable indicators of spread measure how close or far, the bulk of the data lie from the center of the distribution. Most commonly used such measures of spread are the variance, the standard deviation, and the inter-quartile range. The variance is roughly, the average squared deviation from the mean. We denote a sample variance as S squared, and population variance as Sigma squared. To calculate the variance, we first find the difference between the mean. And each observation, in other words, the deviation from the mean for each observation, we square each of these deviations and add them up, for all observations in the data set. Then we find the average square deviation, by dividing the sum by the sample size N-1. We'll talk about why we divide by N-1 instead of N a little later. Let's take a look at this example. Given that the average life expectancy is 70.5 and there are 201 countries in the data set, we can calculate the standard deviation. The first country is 60.3 minus 7.5 away from the mean, and we square this difference, and then we add the square deviation from the mean for the next. Country, and the next, and the next, until we lead, reach the last one. Finally, we divide by the sample size minus one. And the variance comes out to be 83.06 years squared. The units of the variance, is the square of the units of the original data. Since we squared the deviations from the mean in calculation of the variance. This is actually somewhat annoying. Since the result, 83.06 years squared, is not very meaningful. So, why do we square the differences in the calculation of the variance? First, it allows us to get rid of negatives. Otherwise, when added together, the negative and the positive values would cancel each other. For example, in a symmetric distribution centered at zero. Negative 2 and 2 are equally distant from the mean. And if we were to simply add the deviations from the mean, they would cancel each other out. On the other hand, if we first square the deviations, both become positive. But usually, we use the absolute value to get rid of negatives. Why didn't we do that here? Squaring the deviations also the serve the purpose of increasing larger deviations more than smaller ones. Think about it. Negative 2 squared is four, while three squared is nine and four squared is 16. This way, larger deviations are weighed more heavily. However, it's useful to square these deviations. At the end of the day, we would still prefer a measure of variability, that has the same units as the observed data. And therefore, we turn to standard deviance. This is basically the average deviation around the mean. Calculated as the square root, of the variance. Once again, we use the Latin Greek alphabet notation for denoting standard deviation. S for the sample standard deviation and sigma for the population standard deviation. The formula is simple. It's just the square root of the variation. Calculating the variance and standard deviation by hand for reasonably sized data sets is tedious and prone to errors. So we usually use computation for these tasks. However, understanding variability, is essential for doing statistics. A concept that's often confused it with variability is diversity. Let's take a look at this question. Which of the following sets of cars, has more diverse composition of colors? SET 1 with yellow, red, green, purple and blue colors or, SET 2 with three blue and two purple cars. The answer is, set 1 with each car having a different color, the diversity is higher. Now let's ask a different question. Which of the following sets of cars has more variable mileage? Set 1 with cars that ten, 20, 30, 50, and 50 miles per gallon. Or, Set 2 with three cars that get ten miles per gallon, and two cars that get 50 miles per gallon. This time, the answer is actually Set 2. Remember, distributions where more observations are clustered around the center, are less variable, versus distributions where more observations are away from the center, are more variable. We can take a look at dot plots of these distributions, to make that point a little more clear. In the first set, the average gas mileage is 30 miles per gallon, and the values range from ten to 50, but there is one observation at the mean and two others closer to the mean than the endpoints. In the second, the set, the average gas mileage is 26 miles per gallon and the values also range from ten to 50. But there are no observations at or near the mean. Therefore, the average deviation from the mean, is higher for this set. One last measure of spread that we'll discuss is the interquartile range. Which is the range of the middle 50% of the data. It can be calculated as the difference between the first and the third quartiles. That's the 25th percentile and the 75th percentile. This measure, is most readily available in a box plot. Let's take a look back at the box plot of life expectancies. The first quartile al, also denoted as Q1 is 65, and the third quartile is 77. So the IQR is the difference between these two, 12. In describing the distribution of life expectancies, we can say that the middle 50% of countries, have life expectancies between 65 and 77 years old. The value of the IQR itself, 12, isn't as informative on its own, but would be useful when comparing distributions. The reason why the IQR is a more reliable measure of spread in sample data than the range, which is maximum minus minimum, is that it doesn't rely on the endpoints. Which may be unusual observations or potential outliers.