A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

188 ratings

Johns Hopkins University

188 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 2B: Summarization and Measurement

Module 2B includes a single lecture set on summarizing binary outcomes. While at first, summarization of binary outcome may seem simpler than that of continuous outcomes, things get more complicated with group comparisons. Included in the module are examples of and comparisons between risk differences, relative risk and odds ratios. Please see the posted learning objectives for these this module for more details.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this set of lectures we'll actually focus on how to quantify and

summarize associations dealing with binary outcomes both in single groups and

between groups, to estimate the association via our samples for

the population from which the samples are taken.

It's kind of like deceptively simple at first.

The bi way to summarize the binary sample seems intuitive and

single number summary pretty much encapsulates all of the information about

the center, the thread, and the percentiles.

But when we start getting into comparing population through samples, we'll see that

it's somewhat trickier than you'd expect because there's several different ways of

measuring the association between binary outcomes and groups.

So in this lecture, we're going to look at ways to summarize

samples of where we've collected binary outcomes of information and

we're going to look at ways to compare information across these samples.

So in section A, this first section, we're going to talk about

something called the sample proportion as our main summary statistic for

samples of collected binary data.

So upon completion of this lecture section A, you will be able to summarize a binary

outcome across a group of individual observations via the sample proportion.

Explain why, with binary data,

the sample proportion is really the only summary statistic,

besides the sample size n, necessary to describe characteristics of the sample.

And then compute the sample proportion based on the results of a study.

So let's look at the first example.

There are Kaggle HIV data here.

And were going to look at there with 1,000 HIV positive patients from a citywide

clinical population.

We had a data, we've looked at things like the CD4 count, etc., in previous lectures.

These 1,000,

one of the things they measured was the response to antiretroviral therapy.

And what they found in the sample of 1,000 is that 206 of the 1,000 patients

responded.

So what does our outcome data look like at the individual level?

Well, it's either a yes, which could be coded as a 1, or

no which could be coded as a 0, for each of the persons.

They get a 1 or a yes if they did respond and a 0 or a no if they did not.

So our simple summary measure which probably seems intuitive to you,

would be what we call the sample proportion.

And we're going to represent that by the letter p, for proportion,

with a little caret on top or a hat.

And this number is affectionately known, and frequently called p hat.

And it's given by the following, we take the number in the sample who

have our outcome of interest and divide by the total number in the sample.

So in our situation here, we've got 1,000 people in the sample.

206 of them have the outcome, they respond to the therapy.

So our sample percentage of those who respond is 0.206

which is roughly equal to 0.21 or 21%.

And there we've summarized the response in these 1,000 persons.

This p hat maybe called the estimated proportion,

the estimated probability, or the estimated risk of recovery.

So proportion, probability, and risk are synonyms used to describe proportions.

Why do we put the hat on it?

Well, not only is it a well dressed p, but we put the hat on it

to distinguish that this is another trick of statistics here where we put things on

top to distinguish them as estimators of something we can't actually observe.

So we put the bar over an x for a sample mean, we put the hat on the p to suggest

that this is the sample-based estimate for some underlying truth that we won't know.

The underlying true population or

proportion, the true population proportion represented by p.

P hat estimates p, which we can't observe directly.

You can think of the sample proportion, this p hat, as just another example of

a sample mean, but here the data we're averaging isn't continuous in nature but

can only be yes or no or 0 or 1.

So generally, binary values are given a value of x=1 for observations that

have the outcome, and 0 for observations that don't, 1 for yeses and 0 for nos.

So with the 206 respondents of the 1,000, we would have we, if coded these as 1s and

0s, we'd have 2016 people whose value was 1 and the remaining 794 whose value was 0.

So if we took the average of these xs,

206 that are equal to 1 and 794 that they're equal to 0.

If we sum up those 0s and 1s, we'd just get the 206 who are equal to 1 and

then we average that divided by the number of people in our sample, the 1,000, and

we get this sample proportion.

So the sample proportion could be thought of as a sample mean of 0, 1 data.

However, unlike the information with continuous data,

some of the other characteristics of the sample aren't so interesting.

So, we could quantify the variability in our 0, 1 data.

There is a formula for

the standard deviation of binary data that looks like this.

It's a function of the sample size and

then our actual proportion of values that are guess outcomes.

So the sample variability or sample standard deviation is equal to the square

root of n times p hat, that estimated sample proportion, times 1- p hat.

But this quantity is not particularly useful in understanding our distribution.

In fact, if we knew p hat, we pretty much understand how variable the 1s and

0s are because we know the proportion of them that are equal to 1.

However, I'm only throwing this formula in here because it will help inform us about

another measure of variability we encounter further in the course.

What about percentiles of binary data?

Is that really necessary for us to understand what's going on in the sample?

Well, things are simpler with 0, 1 gate in terms of summarization for any one sample.

And if you think about it, if we know p hat, the sample proportion,

we know our sample percentiles.

So, for example, in our HIV data, the percent responding we estimated

to be about, I'm just rounding up for purposes here, but 21%.

So we know that the tenth percentile of these data is 0.

We know the 20th percentile is 0,

79% of the values are 0s and 21% are 1s.

So the 60th percentile is 0, the 70th percentile is 0,

the 75th percentile is 0, the 80th percentile is 1.

We know that because we know that exactly 79% of the values are 0 and 21% are 1.

So again, p hat tells us all we need to know about these binary data.

How about visual displays?

Are they really useful for any single sample?

Well, let's take a look.

Here is a histogram of the yes, no outcomes response to treatment for

these 1,000 individuals.

Notice there's only two bars, the proportion who did not respond and

the proportion who did.

And then there's a lot of empty space in here.

We could sell advertisement, and we could put in a picture of me.

But what do you think about this visually?

It's not particularly more informative than knowing

that p hat = 0.206 or roughly 0.21.

So as much as I like graphics, they're not particularly useful for

summarizing the results of binary outcomes above and beyond that sample proportion.

If we did a box plot, it almost looks comical, right?

We only have two data values, 0 and 1, if we coded the nos as 0 and

the yeses as 1s to response.

And there's no information in this picture about what proportion are 1s and

what proportions are 0s.

So again, binary data is simpler to summarize for any single sample.

The p hat the sample proportion pretty much gives us the entire story of what's

going on with the sample data.

So p hat is actually in all encompassing measure that essentially provides all

the information about the distribution of the 0, 1 values in the sample binary data.

So let's look at another example, here's a seminal study in public health,

one of the first randomized trials to look at the impact of giving

antiretroviral therapy to mothers who are HIV positive and pregnant.

And see if there is any effect on the transmission of HIV to the child,

whether there is any reduction or not.

And so here is a little bit of the abstract and

here is the reference from the article.

See in the abstract, they say, maternal-infant transmission is

the primary means by which young children become infected with HIV Type I.

We conducted a randomized, double-blind,

placebo controlled trial of the efficacy and the safety of AZT,

zidovudine, which I can never say, ergo I'm going to call it AZT,

in reducing the risks of maternal HIV infant transmission.

HIV infected pregnant women

with CD4 counts above 200 cells per cubic millimetre who had not already received

antiretroviral therapy during the current pregnancy were enrolled.

And you can read the rest about the details here but

they basically randomized the woman to get AZT at that point or not.

And then the results, what they actually did was they enrolled 477 pregnant women.

During the study, 489 gave birth to 415 live births.

And then they followed the infants for up to 18 months, and

the HIV infection status within the 18 months was known for 363 births.

And of those 363 births whose HIV status was assessed up to 18

months after birth, 53 infants were infected.

So across both groups, not yet

splitting this out by those whose mothers received AZT and those who didn't.

The overall occurrence of HIV transmission to the infants

in this sample of data was 53 out of the 363 births

whose status was known or a sample proportion at 15%.

Okay, let's look at another example.

This one's on colorectal cancer screening.

And the study researchers wanted to see whether a stepped up

intervention using electronic health record s on automated mailings

increased colorectal screening adherence.

And what they did, was they looked at patients,

about 4,600 patients who are aged 50 to 73 years who were not,

at the time of the study, current for colorectal screening.

Across 21 medical centers, they randomized them to 4-group, parallel-design, control

comparative effectiveness trial with concealed allocation and blinded outcomes.

Meaning that the people in the study and their

healthcare providers didn't know which of the four groups they were assigned to.

And what they were measuring is follow-up with the proportion of the patients

who got current for screening across the both years to the follow-up to see if

there was any impact of the stepped up interventions versus the usual care.

And here are the results.

And I'll just read this to you and

then we'll actually look at the p hats here that were discussing before.

They said compared with those in the usual care group, the standard care group,

participants in the intervention groups were more likely to be current for

colorectal cancer screening in both years with significant increases by intensity.

And then what they do here is report the proportions who are current after

the two year period in each of the groups.

So the usual care group, 26.3%, or a little over a quarter

of them were current after the two year follow-up period.

In the automated group which was one of the intervention groups, about half,

50.8% were, in the assisted group 57.5% and

in the navigated group, so these later three groups

were different approaches to electronic medical records, reminders, etc.

And in this navigated group about two-thirds, or 64.7% were up to date.

And then they put these things called confidence intervals on, which expressed

the uncertain in estimates which we'll go into detail in subsequent lectures.

So if you actually parse this in terms of the way we've talked about,

the summary measures that their p hat, or

proportion of persons who are current for CRC, colorectal cancer screening,

after the two year followup, in the usual care group was 26.3%.

And the p hat for the automated group was 50.8% or 0.508.

Their p hat that they're reporting here as the crucial summary measure for

the assisted group was 0.575 or 57.5%.

And finally, their p hat for the navigated group which are sort

of drawn through here, was 0.647 or 64.7%.

So in summary, what I hope you've got out from this

lecture here is that, it's actually relatively simple

to summarize binary outcomes in any single sample.

This one number summary, this p hat or

sample proportion is a one-stop shopping for summarizing this.

But this can be deceptive as we'll see in the next set of lectures.

Because summarizing these outcomes for

any one sample is pretty straight forward by taping the sample portion.

But when we start to compare these sample portions across different groups,

like across these colorectal screening type randomized groups here,

we'll see that there's several different ways we can take

the summary proportions and actually numerically compare them.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.