Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

52 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Hi, my name is Brian Caffo, and this is, Mathematical

Biostatistics Boot Camp 2, lecture 10, on case control data.

In this lecture, we're going to briefly talk about case control methods.

We'll talk about an instance where using retrospective case control data.

And a so called rare disease assumption, we can estimate prospective odds ratios.

And then because this is kind of a lot focused on the odds ratio,

I thought I'd talk a little bit about exact inference for the odds ratio.

Okay, so let's talk about retrospective, you know kind of case reference sampling.

And again this is a deep subject, we're going to scratch the surface of it.

So in this case, imagine if we wanted to study,

study lung cancer and here we had some cases and controls.

And we ascertained whether or not they were a smoker.

Now there's two ways we could collect it,

well there's.

Conceptually, two ways we could collect this data.

One is, we could follow a bunch of people over time some of them would smoke

and some of them wouldn't, and then we could see who obtained lung cancer.

That, that's very hard.

Right.

I think conceptually, you can all

see that, that experiment is basically impossible.

a much

easier experiment would be to go to hospital records, and find

a bunch of people that were cases, that had lung cancer.

In this case, we found 709 of them.

And then we also found 709 controls that were at some level comparable.

And then we retrospectively determined whether or not they were smokers.

Now, in this case, 709 is fixed, right, and it's

whether or not they were a smoker that kind of has the ability to vary.

Now, I should also say that the most common

way to do case control methods would be, for every

case, to try and very closely match a control,

so that for every case, there's a specific matched control.

But in this case, we're not doing that.

Let's say we had a group of cases, a group of

case hospital records and a group of control hospital records, and we,

or group of control patients, and we figured out

You know, a reasonable strategy for getting control patients.

And now these, these 709 is fixed, so what we wanted

to ascertain is who is a smoker and, and not, and

whether or not the cases had a great proportion of smokers,

and to kind of make prospective conclusions from this retrospective data.

So just you know, in terms of probability.

Right.

We we cannot estimate the probability of being a

case given that you're a smoker directly from the data.

but we can estimate the probability of being a smoker given that you were case.

Right, and so the co-, so we want to work on that.

You know?

Kind of probable probability rubric.

What is interesting, is we can estimate an odds ratio.

so the odds ratio that we would, want to estimate is

the odds of being a case, given that you're a smoker.

Relative to the odds of becoming a case relative

to being a non, given that you're a non smoker.

Okay.

So we want the odds of, of developing lung cancer given that you smoked compared

to the odds of developing lung cancer given that you didn't smoke.

Well, it turns out that that odds ratio is exactly

equal to the odds of being odds of being a

smoker given, you're a case, relative to the odds of

being a smoker, given that, given that you are control.

So, in, in, in the bottom one we can estimate, the top one we cannot.

So here I just directly go

through the calculations.

The odds of being a case given that you're a smoker, divided by the

odds of being a case given that you're a non-smoker, the odds ratio interest.

Right.

And let me just replace case and not with C and S.

And case and non case with C and C bar, and smoker and nonsmoker with S and S bar.

And here I just churn through the calculations.

You can go through these three steps

to make sure that you agree

so here I carry through the calculation.

And look, this works out to be the probability of being a

case and a smoker, times the probability of non-case and a non-smoker,

divided by the probability of Being a case and a non-smoker divided

by the probability of being not a case and not a smoker.

So it's sort of like the probability cross

product ratio, the probability of caseness and smokerness times

the probability of being not case and

non-smoker divided by the kind of off-diagonal probabilities.

Now, and I say this actually proves the

result, and I think it does, because honestly.

You know, you can just see that if you were to exchange the words case and smoker

at top up at the top that nothing changes when we get down to the bottom line here.

Right.

Because probability of C, S is the same as the probability S, C.

so I think you can, you can tell to me, that the, that this.

this is exact legal or if you want to, if you want to be very particular, you

can, you can then keep working and get to the odds, the other odds ratio.

but to me this, this proves the result from the previous page.

And, you know, it also reminds you

that this is, these are the probability statements.

But we estimate those probability and

odds ratios from data, and of course the sample odds ratio is the

cross product ratio n1, n22, divided by n12 and n21.

And the odds ratio is invariant to transposing the rows and the columns.

So it, you know, our estimator has this kind of invariance property.

which we would hope, right.

It would be weird, if we said that the two odds ratios' probabilities

were equal, but oh, the sample estimates were

not equal depending on which, which, which which

one you were treating as the outcome and

which one you were treating as the predictor.

So that's nice.

By the way, the sample odds ratio is unchanged if

a row or a column is multiplied by a constant.

and then the last thing, and this is what we'll talk about.

The odds ratio, turns out to be related to the relative risk.

So you know the thing is if you want odds ratios,

we just kind of demonstrated, that the odds ratio works out really well.

And you can kind of reverse conditioning a

little bit when talking about the odds ratio.

But we'll talk about specifically the relative risk which is what people

often want to estimate, and how it relates to the odds ratio.

Okay, so the odds ratio is here, right?

The probability of a smoker given that your a case divided by the probability

of non-smoker given that your a case. And so on, you can read this top line.

Okay then we

can reverse the odd ratio, right?

using the argument from the other page, right?

So now, we have the probability of a

case given smoker divided by probability of non-case

given smoker, divided by probability of case given

non-smoker, divided by probability of non-case given non-smoker.

Okay.

then in the, in the next line, just everything

is multiplied out.

Denominators are raised up to numerators, and so on.

And then, look at this

first term here.

Probability of case given smoker, divided by probability of case given non-smoker.

that's the relative risk. Right.

That's, if you wanted who develops lung

cancer comparing who's smoked to who didn't smoke.

That's the relative risk.

The ratio of the two probabilities.

And then that's multiplied by times these things, but I you know,

I wanted to, to refer them with a respect to case status.

So I just 1 minus

[INAUDIBLE]

to the probabilities. And what you can see is if this ratio

that we're multiplying the relative risk times if, if, if, if its

about 1, then odds ratio is approximating the relative risk.

so, and you know, often is the case if the, the,

these two numbers, 1 minus this number, and 1 minus that number.

that they're, they're similar enough if in fact the

case is very rare, in, in other words, regardless of

whether or not you smoke, the probability that you'd get

this disease, let's say lung cancer, is, is quite small.

if that's the case, this so-called rare disease assumption, if

that's true, then this ratio will be about 1, and then

the odds ratio will approximate the relative risk, and that's what

people often talk about the rare disease assumption, and they use.

The retrospectively collected data, along with the

odds ratio, to then approximate the relative risk.

It's so common often people don't even really talk about what they're doing.

They just do it.

I think that's so common in the

epi literature, it's, it's generally not described

in a, in a, say, American Journal

of Epidemiology article or something like that.

So now, just make the small point that the disease has to

be rare among the exposed and the non-exposed, not just rare overall.

So here's a simple example.

Chuck Rodi reminded me of this at one point.

So here we have the exposure, yes or no.

Disease yes or no. We have 911999 so just from the data.

And let's just assume that this is just cross

sectional data.

So all the margins or everything are estimable.

So the probability of disease, the estimated profitability of disease

is about 1%, the odds ratio works out to be almost 9000.

the relative risk works out to be about 900,

so clearly the odds ratio is not estimating the relative

risks, and in this case, like I said, because of

the sampling I'm assuming the two are,

are estimate, directly estimable from the data.

So in this case what happens is disease is, is rare among the among the exposed.

I'm sorry, D is rare overall.

Right. let's see, what is it, 10 out of 1010.

but these not-rare among the, among the exposed.

Right.

So among the exposed, you actually had 9 times

the number of people having the disease rather than not.

So any rate, I, I think, you know, this, this is a.

If you look at the equation right, it, it's clear,

you know, that, that both the P of C given as far and P of C given

as both have to be small in order for the rare diseases assumption apply.

And

that's the real criteria.

I think this is just a numerical, this is a numerical illustration in a, in

a hypothetical circumstances where we can estimate all the probabilities as well.

And we can show that the two aren't approximately equal to each other.

So let's just recap about the odds ratio.

So an odds ratio of 1 implies no association.

odds ratio greater than 1 is a positive association.

Odds ratio less than 1 is a negative.

Association the for retrospective case control studies.

Odds ratios can be introspectively for diseases

that are rare among the cases in controls

the odds ratio approximates the relative risk.

and the delta method's standard air for the odds ratio

is the square root of 1 over the cell counts.

added up.

oh and, and just to remind you, that's the standard error for

the log odds ratio, not the standard error for the odds ratio.

So let's just go through our example.

Here is, we have our lung cancer cases, and control, smokers yes or no.

We get our odds ratio works out to be 3.

The inner standard error for the L log odds ratio works to be 0.26.

If we want a confidence interval, it's log of 3

plus or minus 2 standard errors, we get 0.59 to 1.61.

We would compare this interval

to whether or not 0 is in that interval. If we exponentiate it.

Then we would compare whether or not 1 is in the interval.

In this case if we exponentiate it we get 1.8 to 5.0 so

1 is not in the interval it you know, in our estimated odds of lung cancer for

smokers is 3 times that the odds for non-smokers.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.