A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

81 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 2A: Confounding and Effect Modification (Interaction)

This module, along with module 2B introduces two key concepts in statistics/epidemiology, confounding and effect modification. A relation between an outcome and exposure of interested can be confounded if a another variable (or variables) is associated with both the outcome and the exposure. In such cases the crude outcome/exposure associate may over or under-estimate the association of interest. Confounding is an ever-present threat in non-randomized studies, but results of interest can be adjusted for potential confounders.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In this lecture set, we're going to formally define and discuss some ways of

dealing with something we've eluded to and spoke of before.

The idea of Confounding.

So in this set of lectures, we will formally define confounding.

And give some explicit examples of its impact.

Define the idea of adjustment and adjusted estimates conceptually.

And begin a discussion of the analytics or approach to adjustment for confounding.

So first, let's give a formal definition and

some examples of the impacts or potential impacts of confounding.

So in this lecture set, we're going to formally define confounding.

Establish conditions which can result in the confounding of an outcome

exposure relationship, or more generally a relationship between two variables.

And demonstrate the potential effects of confounding on measuring

association via several examples.

So let's just get started with a very over-the-top fictitious example, just to

hammer home with the idea, or to start the discussion about the idea of confounding.

So consider the following results from a fictitious study.

This study was done to investigate the association between smoking and

an outcome of.

We'll just say,

a certain disease in a population of adults both male and females.

And a random sample was taken and

the subjects were classified as to their smoking status at the time of the study.

And there were 210 smokers, and 204 former, 40 non-smokers.

And then they were assessed at to whether they had the disease of interest or not.

And this was not a particularly awful disease, but

it was relatively prevalent in this fictitious population.

So, here are the results broken out by disease and smoking status.

And if you analyze the results of this two by two table,

you'll see that if you look at the association between smoking and disease.

You can compare the proportion of smokers who have disease to

the proportion of non-smokers.

This relative risk is 0.93.

Indicating, at least in this sample, there is a lower risk,

a slightly lower risk of disease among the smokers.

Well, we haven't accounted for sampling variability.

But at least this estimate in this sample,

smoking appears to be protective against disease by a small amount.

But how can this happen?

This goes against everything we understand about smoking and

it's association with morbidities.

So let's just take a look at our data a little bit further.

Let's look at the relationship between smoking and the sex of the person.

So this is new, this is data we collected.

But now I'm showing you a representation the data that

compares smoking status by sex.

And if you look at this, if you look at the smokers, it,

the proportion of smokers who are male is about 76%.

So the majority by a fair amount of the smokers in the sample are male.

But among non-smokers, the proportion who are male, is about 16%.

So right off the bat, we see pretty clearly that there's a strong

association between sex of the person and smoking status in these data.

We go ahead and look at the association between sex and the disease outcome.

If we look at the persons who have this disease,

the proportion of the persons who have the disease who are male is 28%.

And the proportion of males amongst those without the disease is 50%.

So we can see that disease is associated with sex, as well.

And females are more prevalent amongst the disease.

So what's going on here?

We want to associate disease with smoking, but there's this third variable, sex.

Which it seems to be related to disease and smoking.

And because of that, it may possibly be explaining some of the association or

lack of association that we're finding,

when we look directly at the association between disease and smoking.

And ignore this information about sex.

So let's think about this.

The comparison.

So I should say potentially distorted or nullified, even negated.

The comparison of the disease risk between the smokers and

non-smokers is potentially distorted or nullified or lessened, if you will.

By the disproportionate percentage of males among the smokers.

So, when we are getting a comparison on the relative risk scale of

smokers to non smokers.

We look at that comparison.

Remember, the percentage of smokers for males is 76%.

So roughly eight out of ten.

So if we look at smokers, for every 10 we look at,

there would be about eight males and two females.

And it's only 16% among non-smokers.

So if we round up,

we'd expect to see only two in ten are males among the non-smokers.

So if we look at this comparison, it's imbalanced heavily in terms of

the sex distribution of the numerator and denominator.

And recall, males in the sample have a lower,

are less likely to be diseased in females.

And, so, if we take this ratio as is, we're getting something distorted

by the fact that the majority of the numerator consists of people who are male.

And are less likely to have the disease.

And that's why we're seeing this association,

this minor negative association between smoking and disease.

So again, the original outcome of interest is disease.

The original exposure of interest is smoking.

And in this sample, sex is related to both the outcome and exposure.

The relationship that we see is possibly impacting the overall relationship between

disease and smoking.

So how can we assess the degree, or

if sex is distorting the overall relationship in any way?

Well, one approach is to start,

we're going to look at the, we're going to remove the variability in

sex distributions between the smokers and non-smokers.

By stratifying our data and looking separately at each of the two sex groups.

So let's look at males only.

And if I disentangled this data, and you can't directly do it from the way I

presented before, but I have all the data in a database.

If we actually look at the relationship between disease and smoking only in males.

So this comparison is not corrupted by a different distribution of males and

females in the smoking and non-smoking groups.

because we're only looking at males.

If we look at the relative risk now, in this group of males for

disease to smokers and non-smokers, it turns out to be 1.8.

We show an estimated 80% increase in the risk of disease for

smokers among the males compared to non-smokers.

And if we do the same thing for females, we get

a relative risk of disease for female smokers to non-smokers of 1.5.

An elevated risk of 50% in the sample of females for smokers to non-smokers.

So a recap.

The overall, sometimes we call it the crude or

unadjusted relationship between smoking and disease.

The relative risk was nearly 1, and the risk difference is nearly 0.

So it didn't appear at first pass,

that there was much of an association between smoking disease.

But in the sample smoking was somewhat protective.

However, when we looked at the data separately by sex, we saw

increased risk of disease for smokers compared to non-smokers in both groups.

80% and 50%, respectively.

And we're just looking at the estimates for now, so

we're not considering statistical significance, we, we will shortly.

[SOUND].

So this is a pretty striking result.

That background combination of the increased risk of smoking among the males,

and the decreased risk of disease among the males.

Made it look like there was little association between smoking and

disease, when we compared all smokers to non-smokers in the sample.

But this, the overall association was being heavily influenced by

this imbalance of sex distribution between this exposed and unexposed group.

And sex was related to the risk of disease.

So what we were seeing was mainly a negation of the overall smoking and

disease relationship because of this sex component.

And when we, when we remove sex from the story and

looked at the association separately by males and females,

we saw a positive association between smoking disease in both sex groups.

So this example's pretty explicit and contrived, just to illustrate a point.

But it illustrates something, sometimes called Simpson's Paradox.

That the nature of an association can change or

even reverse direction, or disappear when several data from

several groups are combined to form a single group.

In other words, when we took the entire sample of males and

females together, we missed the association between smoking and disease.

So an association between an exposure, X, and a disease or outcome.

Let's say more generally, outcome Y.

And more generally,

we really could say an association between any two variables x and y.

We don't even have to explicitly make one the exposure and one the outcome.

This association can be confounded by another lurking hidden variable,

sometimes called a lurking or hidden variable Z.

Or multiple hidden or

lurking variables, that are associated with both the exposure and disease.

And what a confounder, or a set of confounders does,

is it distorts the true relationship between X and Y.

And so this can only happen if our confounder or

confounders, are related to both X and Y.

So in our example we just looked at, sex was related to both the smoking status and

the disease status of the participants in this study.

So when we get this sort of thing going on,

if we look at a Venn diagram descriptive.

There might be some crossover in the information relating Y to X.

There might be some distortion or crossover because of

the relationship of both, with this third variable or set of variables.

So what's the solution for confounding?

What can we do about confounding?

Well, if we don't know what the potential confounders are, we don't,

there's not much we can do after the study is completed.

Randomization as a study design, is the best protection against confounding.

Randomization essentially, and we'll look at a pictorial of this in a minute.

Limits the, eliminates the potential links between the exposure of interest and

potential confounders.

You know, Z1 through Z3, through Z whatever, how many confounders we have.

And the nice thing about randomization is it limits the potential

links between confounders we could think of and

measure, and confounders we never considered when planning the study.

But in many cases we've talked about, we can't randomize our exposure of interest.

So if you can't randomize, but

have some sense of what the potential confounders are.

There are statistical methods to help control for confounding.

And it's called adjusting for confounders.

This is a tricky thing, though.

Because potential confounders must be known in advance and

measured as part of the study.

And there's always going to be that nagging question of,

did we measure all potential confounders?

So why does randomization minimize the threat of confounding?

So let's look at a situation where we have some outcome.

I'll just generically call it Y.

And some predictor X.

And there's these variables behind the scenes that may confound this association.

In order to confound this association, either distort it or hide it, because of

behind the scenes relationships with these confounders or potential confounders.

These have to be related to both the outcome and the exposure.

So what do we do when we do randomization?

So let's suppose we're looking at a drug trial, and we're looking at some drug and

its impact on some condition that's been shown to be related to age and sex.

Well, there's nothing we can do to change the relationship in

nature between the condition or disease and age and sex.

But if we're doing a randomized trial, and

randomizing subjects to either get a drug or placebo.

By randomizing them to both groups, we can eliminate any systematic links

between age, sex and other things, and which treatment group they're in.

So randomization eliminates this potential systematic link

between the exposure groups, and these potential confounders.

And remember, in order to confound an association,

the potential variables have to be related to both the exposure and the outcome.

So by getting rid of that link, we're minimizing the threat of confounding.

So, let's look at another study.

An observational study to assess, estimate the association between arm

circumference and height in Nepali children.

And we've looked at these data several times before.

So let's suppose we have 150 randomly selected children between 0 and

12 months old.

And they had their arm circumference, weight and height measured.

In fact, we looked at this recently in the unit on linear regression.

Well, the study is clearly observational.

It's not possible to randomize subjects to height groups.

So, the data is such,

that the arm circumference range in this data is 7.3 to 15.6 centimeters.

The height range is, as stated here.

And the weight range is given by 1.6 to 9.9 kilograms.

So, as we saw back in the unit on linear regression,

if we fit a linear regression to estimate the association.

We found a positive.

And it turned out to be statistically significant as well,

association between arm circumference and height.

But notice perhaps not surprisingly, that arm circumference is strongly,

positively associated with weight of the child and height.

[SOUND].

Is positively and strongly associated with the weight of the child.

And these lines here are the respective regressions,

lines of arm circumference on weight, and height on weight.

So here's what we get if we actually, and

we'll talk about adjustment in the next section.

But if we actually re-estimate the relationship between arm

circumference and height.

But remove the behind the scenes relationship between arm circumference,

weight and height.

In other words, adjust for

weight differences and the different height groups.

The association we now get between arm circumference and height is negative.

In other words, if we've made the comparison subst,

when we compare children who differ by height.

We've made it such that this comparison is amongst children of the same weight.

And we'll talk about this more in the next session.

But essentially, if we're comparing children of similar weight,

who differ by height.

Then amongst groups of children who are similar in weight,

the relationship between arm circumference and height is negative.

And the estimated regression slope here for

height, after pulling out the behind the scenes association with weight in

these two variables, is negative, negative 0.16.

And just consider that for a moment.

Think about why that may be.

Here's another study we can look at.

This is a pretty interesting example.

This was a longitudinal study done in South Africa.

It's a birth cohort.

Followed for five years after birth.

And so what they did was they actually,

actually collected information on this birth cohort.

And followed them up at the five year mark.

And what they wanted,

wanted to see is there was a fair amount of drop out in this study.

Understandably, if they only measured them once every five years.

But what if they wanted to see for design of future studies, is if there was any

information they could use to predict who would drop out and who wouldn't.

And maybe customize their follow-up intensity

depending on these characteristics.

So what they found if they looked at whether or not the subjects or

the families who were followed or initially selected, whether or

not they participated in the followup.

They looked at their medical aid status.

Whether they had public insurance or not.

And what they found in the overall cohort was,

the relative risk of actually participating in the follow up visit.

For those who actually received public insurance or

medical aid compared to those who didn't was 0.7.

So, it looks like medical aid was associated with a reduction in

follow-up participation on the order of 30% based on these data.

But it turns out,

if they actually stratified this by the race of the participants in the study.

And first looked at the black participants, so

they were classifying race as black and white.

If they only looked at the black participants,

the relative risk of follow-up for

those black participants on medical aid versus those not, was equal to 1.

There was absolutely no difference in the proportion in the samples we followed up

after five years, amongst those on medical aid and those not on medical aid.

Similarly, if they looked at all the white participants only and

looked at those who receive public insurance or aid from the government.

Versus those who didn't, the relative risk of follow up among white participants on

medical aid compared to those not, was 1.05.

So a slightly elevated proportion with those on medical aid

who participated in the follow-up visit amongst the white participants.

So let's just pause for a moment and think about what's going on here.

When we looked at everyone all together, there's a negative association between

participation in the follow-up visit and being on medical aid.

But when we stratified and looked separately by race groups,

there was little to no association between participation in the follow-up visit and

medical aid.

So what's going on?

Well, if we go back and look at some characteristics of the sample,

the majority, 91%, were black.

And 26% of the Black families or subjects completed the follow-up,

as opposed to only 9% of the White subjects.

However, only 9% of

the Black subjects had medical aid, compared to 83% of the White subjects.

So what's going on here?

In the initial comparison of medical aid versus no medical aid.

If the numerator

was majority White, and the denominator was majority Black.

But Whites were much less likely, 9% versus 26%,

much less likely to participate in the follow-up visit.

So this comparison was distorted by

the disproportionate amount of White families in themed, receiving medical aid.

And the fact that White families were far less likely to participate in

the follow-up.

Let's talk about one more example.

I'm not going to give an example of this per se, but

talk about something that comes up a lot today with genomics, and

sequencing, and gene expression type studies.

That it's, it's been discovered to be a big problem that needs to be corrected for

both in the design of the study, and the analysis.

But in what are called Batch Effects in lab-based analyses.

Well, lab based results can be influenced by the technician,

the laboratory used, the time of day, the temperature in the lab, et cetera.

So if the goal of a study,

is to ascertain differences in lab measures between groups.

For example, differences in gene expression levels between those with

a disease and those without.

And the group, like disease or

non-disease, is associated with at least some of the above characteristics.

Then there can be confounding.

So, for example, if the majority of the diseased subjects under

study were analyzed by technician 1 And

the majority of the non-diseased subjects were analyzed by technician 2.

And the study finds a differential in the measured lab result on average,

between these two groups.

That could be because the result differs among those who

are diseased and non-diseased.

But it could be heavily influenced by the fact that the technician was correlated or

associated with the disease group, and the measurements themselves.

So it's just something to think about even in non-population based studies.

In things that seem to be well controlled in laboratory situations,

there can be a threat of confounding.

So in summary, in non-randomized studies, outcome/exposure relationships or

just more generally relationships between two measures of interest,

may be confounded by other variables.

And in order to confound the outcome exposure relationship,

a variable must be related to both the outcome and the exposure.

What we're going to look at in the next two sections, is in the next section we'll

talk about interpreting what's called a confounder adjusted association.

And we'll talk about that means.

And what comparisons are being made by that association.

More formally, we alluded to it in the weight and

arm, in the arm circumference height example in this section.

But we'll talk about it formally.

And then in section C,

we'll give a little intuition behind the mechanics of adjustment.

And that will set the stage for our next chapter on multiple regression methods.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.