So, in this example,

using the properties of the normal curve to estimate an interval contained in

the middle 95 percent of length of stay values for

the claims population data yields useless results.

Better to take the observed 2.5th and 97.5th percentiles

of the sample data and report these as an estimate of the middle 95%.

Based on this sample,

we estimate that most,

95 percent of the persons making claims in this health care population,

had length of stay between one and 21 days in 2011.

Suppose we wish to use this data to estimate the proportion of

the claims population with total length of stay of greater than five days.

So, we're trying to plan for the future and get an estimate on

the percentage or proportion of persons who would have longer length of stays.

We consider longer to be greater than five days.

So, if we translate this measurement of five days to units of standard deviation,

like we might be tempted to do because that's how

we sort of do it in the previous section,

we can find where five days is relative to the sample mean,

like the stay in terms of standard deviations.

So, to do this,

we'll first find what we call the z-score,

there's nothing magical about a z-score,

we can do this for any type of distribution,

we're just measuring how far an observation is

from the mean of the distribution in units of standard deviation.

So, if we do this we have our observation of five days our cutoff,

looking at the percentage greater than or equal to that, greater than that,

and we subtract the mean,

so we get a raw distance of 0.6 days, but of course,

we can't determine whether 0.6 days

is where it falls relative to the other observations in the curve,

unless we standardize it by standard deviation.

Even then we're going to have problems because we have data

that's not approximately normally distributed.

So, if we do this we get measurement that's approximately

0.12 or a little over a tenth of

a standard deviation above the mean of this distribution.

I'll let you verify if you wish or you can just take it my word on it.

But the probability of getting a result that is

greater than 0.1 standard deviations above

the mean of a normal distribution is 0.45 or 45 percent.

So, if we took this approach,

we'd estimate that almost half of the persons in

our population had length of stays are greater than five days.

But again, we were applying properties in the normal distribution to make

this computation in a data that was

decidedly skewed and not roughly symmetric and bell-shaped.

So, if we look at some empirical percentiles of the sample data,

we actually see, going down here,

that the 75th percentile is five days.

So based on this chart,

we could dig down a little bit further and get more specific percentiles like

the 74th and see if that was five or if that was four.

But just based on this chart,

we estimate that approximately 25% of

the observations have length of stay of greater than five days,

and not the 45 percent that we would have estimated

by improperly ascribing the properties of

the normal distribution in terms of distance from the mean

and area or percentage of observations falling under a portion of the curve.

So, based on these analysis,

we estimate that about 25 percent of

the claims had total length of stay greater than five days,

and this actually properly estimated percentage is a lot

smaller than the estimated 45 percent we got using just the mean and standard deviation.

Because again, length of stay data are right skewed,

and so the proportions that fall within units of standard deviation from

the mean are not comparable to what we find that they're

approximately normally or roughly bell-shaped curve.

So, let's look at one more example where we have a skewed distribution in

our sample as evidence of

a skewed distribution in the population from which the sample was taken.

So, here we have CD4 counts for random sample of a

1,000 HIV positive patients from a citywide clinical population.

You can see, the mean of the sample is

280 cells and the median comes in that's smaller, at 249 cells.

We have evidence perhaps not as extreme like the stay data,

but we have evidence of a right skew here,

the right tail that the majority of the values are on the smaller side,

and the extremes are larger in the positive direction.

So, if we used only the sample mean and

standard deviation and incorrectly assumed normality,

we could estimate the 97.5th and 2.5th percentiles of

CD4 counts in this population by using just the mean and standard deviation.

Again we'd say, well the 2.5th percentile could be

estimated by taking the mean minus two standard deviations,

and the 97.5th percentile could be taken by the mean plus two standard deviations.

If we did that, we estimate that most,

95 percent of the population of HIV positive persons had CD4 counts

between a negative 116 and 676 cells per millimeter cube.

This doesn't make a lot of sense because CD4 counts cannot be negative.

So that's a huge red flag.

We had access to these cells and data points,

we could get the actual observed 2.5th and 97.5th percentiles from the 1,000 data points,

and these are 11 on the low end and 722 cells per millimeter cubed on the high end.

So, again, we've certainly got something logical

and corresponds to the range in our observed data of 11,

and so we would have done better certainly on the lower end.

It turns out you might say, well,

why didn't you just truncate this interval here at one or two or something like that.

Well, even if we did that,

we might do so badly on the 2.5th percentile but we'd

underestimate again the 97.5th percentile,

which comes in at around 722.

So, in summary, while sample means and standard deviations are

useful summary measures regardless of the data for which they're computed,

they can help us understand the center and spread,

and with the median as well,

but they don't necessarily tell us more than that.

So, these two quantities do not always help characterize the data distributions,

this is worked only when the data is approximately normally distributed.

For skewed distributions and others that are not approximately normally distributed,

using only the mean and standard deviation to characterize

the entire underlying distribution can result in the best incorrect results,

and at worst nonsensical results,

like negative length of stays.