A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

188 ratings

Johns Hopkins University

188 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay, now let's do some review exercises, to go over

what we've covered in this lecture set on confidence intervals.

And while these review exercises will involve hand computations, that's just for

practice and thinking about the process by which we do these intervals.

But what's more important is to think about

the interpretation of the intervals once you've computed them.

And so we'll discuss that as well. So let's start.

I'm going to lay out the questions for you then suggest you pause

the video, work on them at your leisure, and when you're ready to

take a stab at comparing your results to mine come back and review

the rest of the video, where I'll show my take on the solutions.

So the first question, suppose

an independent environmental group computes the

gas mileage for a random sample of 100 new models of the

same car, with the same make and model, in order to

make a statement about the gas mileage of this make and model.

And the results on these 100 cars include the following summary statistics.

The sample mean mileage of these 100 cars is 31.4 miles per gallon.

The sample stand deviation, or the variation in the individual

miles per gallon measurements for these 100 cars is one point two

miles per gallon, and the sample median is 31.2 miles per gallon.

So first I'd like you to assuming, assume that the gas mileage j if we were to look

at a histogram for these 100 gas mileages for

these 100 cars it would approximate a normal distribution.

Under that assumption estimate a range of gas mileage for most, for roughly

95% of the cars of this make and model based on the sample results.

And then I'd like you to do the same thing but without

assuming normality.

Then for c, I'd like you to assume the gas mileage again is normally

distributed, these 100 measurements on these 100

individual cars or approximate a normal distribution.

Estimate a 95% confidence interval for the mean gas

mileage for all cars that are this make and model.

And then d, without assuming normality of the individual gas

mileage values for the 100 cars estimate a 95% confidence interval for

the mean gas mileage for all cars that are this make and model.

And I would like you to ponder what is the difference in the interpretation of the

intervals created in the first two questions a

and b versus those created in c and d?

For this next question we're going to look at an article that was published in the

2007 in the American Journal of Public Health regarding data were taking from

a 2004 random sample of 960 high school students in Haifa, Israel.

To look at the association between post traumatic stress

induced by terrorist attacks, or threats, and substance abuse.

Two of the finds from this study, are that 35% knew at least one person who had been

killed in the terrorist attack, and that 10% of the sample

had used marijuana in the 30 days prior to the study.

So I'd like you to estimate 95% confidence intervals for the

proportion of all high school students in Haifa in 2004 who,

a, knew at least one person killed in a terrorist attack,

and b, who had used marijuana in the prior 30 day period.

Now let's recall our attention back to the article on anti-retro viral therapy

early versus delayed in HIV transmission in serodiscordant couples.

The total amount of follow up time accrued by couples in this study was 3,250 years.

And there were a total of 39 partner to partner HIV transmissions.

Estimate the incidence rate of partner to partner transition,

transmission for this sample, and give the corresponding 95% confidence interval.

And then lastly, I'd like you to confirm my

computations for the 95% confidence interval for the proportion of

persons getting screened for colorectal cancer based on our

study results that we looked at in lecture seven b.

Where the proportion in the usual care group who were screened was 26.3%, and

the sample size was 1166.

And then I'd like you to think about this for a minute.

Suppose the researchers had instead reported the proportion who

did not get screened for colorectal cancer in this group.

What is that proportion?

What is the standard error of that estimated proportion?

How does this 95% confidence interval compare to the 95%

confidence interval for the proportion who did get colorectal screening?

All right so now I suggest you turn off the video,

and get busy on your own and when you're ready and

have pondered these things and proffered some solutions why don't you

turn it back on and see how they compare to mine.

Welcome back, I hope you found this to be a useful and thought-provoking experience.

So this first question was the one about gas mileage.

And here this just lays out the data again.

The sample mean of 31.4 miles per gallon, and the sample standard

deviation of the 100 car mileage estimates of 1.2 miles per gallon.

The first thing I asked you to do is, assuming

the data is normally distributed, estimate a range of gas

mileage for most of the cars of this make and model based on the sample results.

So here we're talking about a range of possibilities for individual

observations, in the population of all cars of this make and model.

We're not talking about a range of possibilities for a summary measure,

but for individual elements of the population, individual cars.

And under the assumption of normality we can estimate this by taking the

sample mean plus or minus 2 sample

standard deviations, variation in the individual values.

And if we do that, 31.4 is the mean

plus or minus 2 times the standard deviation of 1.5.

This gives an interval of 29 to 33.8 miles per gallon.

So the interpretation of this interval is under this assumption of

normally distributed data in the population and then likely in the sample.

That most cars of this make and model, 95% of cars in

this make and model will have

gas mileages, individual gas mileages between 29

miles per gallon and 33.8 miles per gallon.

Then in question b I asked you to not assume the normality, the individual car

gas mileage values and to go ahead and estimate a range of gas mileages for most

of the cars of this make and model based on the sample results and this is

a trick question because, it's a trick question

because we can't do it without further information.

We can't necessarily assume that taking the mean

plus or minus two sample standard deviations will give us a valid interval

because we don't know whether the data

comes from normally distributed population or not.

And we don't have enough information to assess that.

Had I given you the 2.5th and 97.5th percentiles of these 100 values, then we

could use those to create such an interval,

but without that information, there's no way to

do this.

So this can't be done with the information given.

[BLANK_AUDIO]

So then I said assuming the gas

mileage is normally distributed, estimate a 95% confidence

interval for the mean gas mileage for all cars that are this make and model.

This is sort of again a trick question because to use the Central Limit

Theorem, we're making a statement about the sample mean not individual

values in the population, and we don't require that the sample or

population of individual values be normally distributed.

That's one of the beautiful things about the Central

Limit Theorem, cause if even when our individual values are

not normally distributed, we can still make inference and do

a confidence interval on a sample mean using this approach.

So that is sort of a red herring.

So the way we compute the confidence interval for

the true population mean, mean gas mileage for all

cars of this make and model, is take our sample mean

of 31.4 miles per gallon, and add and subtract 2 standard errors.

Not standard deviations of the individual car mileage

values, but two standard errors, where the standard error

reflects the potential variability in gas mileage means

across multiple random samples based on 100 cars each.

And if you do this,

you get a confidence interval that goes from 31.4 plus or minus 2 times 0.12.

If you do the math you get a confidence interval from 31.16 miles

per gallon to 31.64, and you could certainly round that to 31.2 to 31.6.

And part d is

sort of, without assuming normality, well

again, for confidence interval creation, you're

making a statement about the population

mean that assumption is ancillary and irrelevant.

So the answer would be exactly the same.

We'd use the exact same approach.

The creation of the confidence interval is not

conditional upon the individual level data being normally distributed.

That's why this Central Limit Theorem result is so powerful

in that we can do imprints and make statements about a

population level quantity, regardless of the

data in this population, the distribution.

So the answer would be exactly the same as it

was in the previous question cause it's really the same question.

[BLANK_AUDIO]

So I just wanted to reel you in and think about the

difference between these two types of

interviews, because we, intervals because we've

talked about each of them a lot on their own in their

respective lecture sections, but now we should think about comparing them head on.

And the first interval that you could only create, given the information

in part a, where you assumed the data came from a normally distributed

population and could have created in part b, if I gave you the

relevant percentile, but the interpretation of that interval

is a range of individual values in the, in the population.

It's an estimated range for the individual car mileage measures in the population.

It gives an interval

that contains the arranged and encapsulates most of those cars, that's

the 95% of the cars had mileages between 29 and 33.8 miles per gallon.

This describes something about the variability in individual

values in a population, using the sample results.

The confidence interval is not making the statement about potential

variability in individual values in the population, it's talking about potential

values for one number that summarizes all individual values in the populations.

So the confidence interval is a range of values for the true mean

of the population, a single number summary on all values in the population.

So

then I asked you to look at this study from 2004, or published in 2007, but the

data came from 2004 on the sample 960 high school students in Haifa, Israel.

Let's just jump right in.

So I asked you to estimate a

95% confidence interval for the proportion of students

in Haifa in 2004 who knew at least one person killed in a terrorist attack.

And the sample

proportion we were given was 35%. And if you

actually do this 30, 35. Take the 0.35,

sorry for the messy handwriting, plus or minus

2 estimated standard errors which would be

square root of 0.35 times 1 minus 0.35 or

0.65 divided by the sample size 960. This would

give an estimated interval of 0.35

plus or minus roughly 0.03 or an

interval of 0.32 to 0.38,

or 32% to 38%. So this

interval takes our best estimate for this proportion, 35%, and adds

in the uncertainty associated with taking an imperfect subsample of this population.

So this is pretty interesting.

So it, it, this helps us quantify the burden of terrorist

attacks, or at least familiarity with them among the student population and

suggest that between 32%, 38% of students from Haifa,

Israel knew somebody who was killed in a terrorist attack.

About a third of the population somewhere on the order

of magnitude, so this there's, well there's some variability in the

possibilities that it, it gives, gives us a sense of

the magnitude of this problem in that population at that time.

Similarly, let's look at the proportion who'd used marijuana,

and put confidence limits in that in the 30 day prior period prior.

So the sample proportion was 10%, so to get a confidence interval

we take that 10 plus or minus 2 estimated standard errors, 10%

who had used it times the 90% who hadn't over 960,

and we could go through all the math here, but when the dust

settles, we find a confidence interval 0.08, 0.12, 0.12 or 8% to 12%.

So this helps us quantify the burden if you will,

or degree of marijuana use in this high school population.

On the order of about a tenth of the population, somewhere between

eight and 12% uses, have used marijuana in the prior 30 day period.

Then we went back to the article on anti-retroviral therapy in

the early versus delayed in the HIV transmition in the serodiscordant couples.

And I said, in this sample, which was a mixed sample

because some are randomized to intensive or early anti-retroviral therapy and

others weren't, but, so you think of that as mixed population

in terms of whether they got the accelerated treatment or not.

But the total amount of follow-up time accrued by the couples in the study

was 3,250 years, and there were a total of

39 partner-to-partner HIV transmissions, and I ask you to

estimate the incidence rate of partner to partner transmission

for the sample and the corresponding 95% confidence interval.

If you do this to get the estimated incidence rate.

We take 39 events over

3,250 years of follow-up. So with 39 transmissions,

[BLANK_AUDIO]

this would give us an instance

rate of 0.012 transmissions

per year per person year

of followup. Per year of followup,

and if we wanted to get the confidence interval, we take

our estimate, instance rate, and then subtract two estimated

standard errors which would be the square root of 39 over the person

follow-up time, 3250 follow-up years. I'll cut to the chase here.

But if you do the math, you get

a confidence interval of 0.008 transmissions per year

to 0.0158 transmissions per year.

So this helps us establish bounds for the actual, real

transmission in a population from which this sample is taken.

And if you actually go to the methods section of this article that we've

looked at before, they actually, if you go back and looked at what we looked at in

the lectures, they do report these incidence

rates, but within per 100 person years.

So their observed incidence rate is, instead of that 0.12 rate incidence of

transmission per one year, they prorated this to per 100 person-years, so

they got 1.2, 0.012 times 100, is 1.2 per 100 person-years.

And their

95% confidence interval for that went from point nine transmissions per

100 years, person years to one point seven transmissions per person year.

If you took the results of what we got, that 0.008 and took 0.0158 and multiplied

those by 100, we'd get 0.8 transmissions per 100 person-years to 1.6 transmissions

per 100 person-years.

And the reason our results are slightly different is because they've used

those exact computations we've discussed and because there were so few events.

This is somewhat of a small sample size in terms of the

information content, so our results don't exactly agree with what they're reporting.

But our results were at least plausible in this

situation, and pretty much get at the same idea.

Finally, I asked you to confirm my

computations for the 95% confidence interval for

the proportion of person getting screened for

colorectal cancer, based on the study results.

And hopefully when you, when you did this, when all the

dust settled you did get the same answer that I did.

Which was a confidence interval with rounding from 24% to 29%, point two

four to point two nine, but what, what I want you to think about.

Something we didn't state explicitly in these lectures but

hopefully it's somewhat obvious, is I asked you, what

if the researchers instead reported the proportion who did

not get screened for colorectal cancer in this group?

So the proportion who did get screened was 26.3%.

So the question is what if they had reported the proportion

who did not get screened instead, which is totally fine.

The proportion who did not get screened, well that

would be one minus the proportion who did get screened.

One minus the 0.263 or 26.3%. Which

is equal to, 73.7%,

or 0.737. So

notice, that regardless of how we report this, the proportion of

yeses or the proportion of nos, the standard

error, is exactly, the same.

They have this symmetry, because it involves whatever

our proportion is times one minus that proportion.

So the standard errors are exactly the same.

We're encapsulating these same exact uncertaintly.

We're just stating it in the opposite way. So if you do the confidence interval

for the proportion in one direction.

So the true 95% confidence interval for

the true proportion of people who were screened

in this population, it goes from 24 to 29% as we show, talked about last slide.

I asked you what was the relationship had we done the confidence interval this way.

Well, it turns out, had we done the confidence interval in this direction,

in this direction it would go from 71% to 76%.

Notice that the endpoints switch, switch

direction of the opposites of the previous.

In other words, if 24% of the population had been

screened, then 76% would not have been screened.

And had 29% of the population been screened, then 71% would not be screened.

So there's a one to one symmetry.

If you had the information in one direction.

Those who were screened, both the estimate and the confidence

interval endpoints you could immediately get the information had we, quantified

it as the proportion who would not have been, who were

not screened just by taking the estimate and the confidence interval

endpoints and subtracting them from one.

These are two different ways of saying the exact same information.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.