In the last section, we showed that if we were willing to make the assumption of normality based on some evidence and the visual display of a distribution like a histogram, that either the sample data and hence the population were approximately normally distributed, we could use the mean and standard deviation coupled with the properties of a normal distribution to estimate percentiles and cutoffs. What we're going to see in this section is that those rules only work well when our data come from a normally distributed population, and hence our sample data also approximates a normal distribution. But when we don't have that situation, we're going to see some strange results. The reason I'm pointing this out to you as a warning, is that sometimes in statistic classes people get a set of rules stuck in their head, but don't think about the conditions under which they apply, and mistakenly apply some of these things we've done, like creating the interval for the middle 95 percent, by taking plus or minus two standard deviations, when it can yield unrealistic or even impossible results. So, upon completion of this lecture section, you will be able to describe situations in which using only the mean and standard deviation of a distribution of values to characterize the entire distribution will not work well. I want you to recall that z-scores, the thing we call z-scores are just a standardized measure of distance, they're nothing special. They become something useful and special, with approximately normally distributed data but in data that's not approximately normally distributed, the underlying z-scores do not necessarily align with the corresponding percentiles from a normal distribution when the data is not approximately normally distributed. So, I'll get you to thinking we've done this approach, but what the right approach is to estimating ranges and percentiles for non-normal data. So again, the normal distribution is a theoretical probability distribution and no real data is perfectly described by this distribution. We certainly saw some examples of data that were approximately symmetric and bell-shaped, but if for only this reason, they can never be truly normally distributed because in a true normal distribution, the range of data values is infinite, even though the majority of the data values under a normal distribution fall within plus or minus three standard deviations, theoretically, there are some values, very low proportion of values that extend out to positive infinity in the positive side of things, and negative infinity in the other direction. So, we've seen examples where the distribution of some data will be well approximated by a normal distribution, in such situations, we can use the properties of the normal curve to characterize aspects of the data distribution like we did in the last section. But the distribution of much data will not be well approximated by a normal distribution. In such situations, using the properties of the normal curve to characterize aspects of the data distribution but yield invalid and even nonsensical results. So, let's start with an example that we're well familiar with, where we know that this data do not approximate a normal distribution. This is the sample of 12,928 length of stay values, in 2011, for patients in the Heritage Health System. What you've seen from lecture set two, we've looked at this and concluded that these data have a heavy right skew, that the majority of values are small, and the less frequent extremes are much larger. Hence, we get that right skew and we see evidence of this in the fact that the mean, the sample mean is much larger than the sample median because it's being affected by those large outlying values. If I were to superimpose a normal curve to this histogram, with the same mean and standard deviation as these 12,928 values, mean of 4.4, standard deviation of 4.7, you can see it's not a very good fit. In fact, we cut off a lot of the theoretical normal distribution as it does not fit the data observed. So now, let's pretend to ignore this evidence of right skewness and say look, we've only got a sample mean and the standard deviation, let's just assume normality. Pretend we didn't think about things and look at the data before hand, and let's estimate the 2.5th and 97.5th percentiles for length of stay in this population using only the mean and standard deviation from our sample of observations. So, if we were to do this and pretend the data were roughly normally distributed in the population from which the data sample were taken, we could do this by taking our sample mean and adding and subtracting two standard deviations. So, to get the 2.5th percentile, we would take the sample mean of 4.4 and subtract roughly two standard deviations, where standard deviation is 4.7 days. We get a lower bound here of negative five days. On the upper end, the 97.5th percentile taking that mean of 4.4, and adding to standard deviations of 4.7, we get 13.8 days. So, based on this sample data, if we erroneously assume normality or approximate normality, we would estimate that most of the persons making claims in this health care population had length of stays between negative five and 14.1 days in 2011. Certainly, that doesn't make any sense, we can't have a length of stay that's negative. So, you might say well John, length of stay has to be positive, we know from histogram that the minimum length of stay was one day, so why don't we just truncate this interval at one, take an interval from one to 14.1, and that will certainly fix the interval so that it didn't have non-possible values, but it doesn't necessarily correct things on the other end either, because if we look at the empirical 2.5th, and 97.5th percentiles of these over 12,000 sample values, in other words, use the computer to order them from smallest to largest, and pick off the value that's greater than or equal to 2.5, handle the values, and also the 97.5th percentile which is less than or equal to only 2.5 percent of the values, we get a range from one day to 21 days. So, even if we truncated the previous interval at one, and you got the proper lower bound, we really underestimate the upper bound on this interval. So, the empirical percentiles is the only way to get a proper representation of this middle 95 percent of the values when we have non-normal data, we can't just use the mean and standard deviation. So, in this example, using the properties of the normal curve to estimate an interval contained in the middle 95 percent of length of stay values for the claims population data yields useless results. Better to take the observed 2.5th and 97.5th percentiles of the sample data and report these as an estimate of the middle 95%. Based on this sample, we estimate that most, 95 percent of the persons making claims in this health care population, had length of stay between one and 21 days in 2011. Suppose we wish to use this data to estimate the proportion of the claims population with total length of stay of greater than five days. So, we're trying to plan for the future and get an estimate on the percentage or proportion of persons who would have longer length of stays. We consider longer to be greater than five days. So, if we translate this measurement of five days to units of standard deviation, like we might be tempted to do because that's how we sort of do it in the previous section, we can find where five days is relative to the sample mean, like the stay in terms of standard deviations. So, to do this, we'll first find what we call the z-score, there's nothing magical about a z-score, we can do this for any type of distribution, we're just measuring how far an observation is from the mean of the distribution in units of standard deviation. So, if we do this we have our observation of five days our cutoff, looking at the percentage greater than or equal to that, greater than that, and we subtract the mean, so we get a raw distance of 0.6 days, but of course, we can't determine whether 0.6 days is where it falls relative to the other observations in the curve, unless we standardize it by standard deviation. Even then we're going to have problems because we have data that's not approximately normally distributed. So, if we do this we get measurement that's approximately 0.12 or a little over a tenth of a standard deviation above the mean of this distribution. I'll let you verify if you wish or you can just take it my word on it. But the probability of getting a result that is greater than 0.1 standard deviations above the mean of a normal distribution is 0.45 or 45 percent. So, if we took this approach, we'd estimate that almost half of the persons in our population had length of stays are greater than five days. But again, we were applying properties in the normal distribution to make this computation in a data that was decidedly skewed and not roughly symmetric and bell-shaped. So, if we look at some empirical percentiles of the sample data, we actually see, going down here, that the 75th percentile is five days. So based on this chart, we could dig down a little bit further and get more specific percentiles like the 74th and see if that was five or if that was four. But just based on this chart, we estimate that approximately 25% of the observations have length of stay of greater than five days, and not the 45 percent that we would have estimated by improperly ascribing the properties of the normal distribution in terms of distance from the mean and area or percentage of observations falling under a portion of the curve. So, based on these analysis, we estimate that about 25 percent of the claims had total length of stay greater than five days, and this actually properly estimated percentage is a lot smaller than the estimated 45 percent we got using just the mean and standard deviation. Because again, length of stay data are right skewed, and so the proportions that fall within units of standard deviation from the mean are not comparable to what we find that they're approximately normally or roughly bell-shaped curve. So, let's look at one more example where we have a skewed distribution in our sample as evidence of a skewed distribution in the population from which the sample was taken. So, here we have CD4 counts for random sample of a 1,000 HIV positive patients from a citywide clinical population. You can see, the mean of the sample is 280 cells and the median comes in that's smaller, at 249 cells. We have evidence perhaps not as extreme like the stay data, but we have evidence of a right skew here, the right tail that the majority of the values are on the smaller side, and the extremes are larger in the positive direction. So, if we used only the sample mean and standard deviation and incorrectly assumed normality, we could estimate the 97.5th and 2.5th percentiles of CD4 counts in this population by using just the mean and standard deviation. Again we'd say, well the 2.5th percentile could be estimated by taking the mean minus two standard deviations, and the 97.5th percentile could be taken by the mean plus two standard deviations. If we did that, we estimate that most, 95 percent of the population of HIV positive persons had CD4 counts between a negative 116 and 676 cells per millimeter cube. This doesn't make a lot of sense because CD4 counts cannot be negative. So that's a huge red flag. We had access to these cells and data points, we could get the actual observed 2.5th and 97.5th percentiles from the 1,000 data points, and these are 11 on the low end and 722 cells per millimeter cubed on the high end. So, again, we've certainly got something logical and corresponds to the range in our observed data of 11, and so we would have done better certainly on the lower end. It turns out you might say, well, why didn't you just truncate this interval here at one or two or something like that. Well, even if we did that, we might do so badly on the 2.5th percentile but we'd underestimate again the 97.5th percentile, which comes in at around 722. So, in summary, while sample means and standard deviations are useful summary measures regardless of the data for which they're computed, they can help us understand the center and spread, and with the median as well, but they don't necessarily tell us more than that. So, these two quantities do not always help characterize the data distributions, this is worked only when the data is approximately normally distributed. For skewed distributions and others that are not approximately normally distributed, using only the mean and standard deviation to characterize the entire underlying distribution can result in the best incorrect results, and at worst nonsensical results, like negative length of stays.