An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

123 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

A common occurrence in genomic and genetic data is that the outcome is

actually a binary variable rather than a continuous variable.

In that case one option is logistic regression.

So as an example of this, I'm going to illustrate it with a case-control study.

So suppose that you've collected a large number of cases, say a large number of

cases of people with cardiovascular disease and an equal number of controls.

So people that don't necessarily have had cardiovascular disease or

at least not measurably.

And then you genotype them in a bunch of loci.

For one particular locus, imagine that you can either have a C or a T,

then you can build this two by two table that says, if you're a C or

a T and if you're a case for control.

So then the next thing that you might want to ask is,

are those two variables related to each other?

So one option for

doing this is you could just fit the standard linear regression model.

So you could relate the case control status which is either a 0 or

a 1, to the genotype which is equal to 0 or 1 as well if it's a C or

a T based on this linear regression model.

And then you'd have an error term just like you'd have before.

The problem is here you can imagine getting a model fit and an error term such

that that fit was outside of 0,1 even though the variable itself is 0 or 1.

Moreover, and you could get any continuous number for

this regression over here on the right, and

you actually only had two potential real values on the left-hand side.

So that doesn't make a lot of sense.

So the first step that you could make is to recognize that's not

a continuous variable and instead try to model the probability.

So you could model the probability that your case and

you could model that as a function of the genotype.

Here we've now eliminated this error term because we're not modeling this

continuous variable anymore, we're just modeling a probability.

So we have some model here for that probability.

The only problem is, is that probability's always between zero and one.

And so, if you fit a linear regression model you might get values that are larger

than one or smaller than zero.

So another way they you could do this is you could take the log of the probability.

So that if you set the probability that the case is equal to 1,

equal to the variable p, and

you model the log of that probability as a linear function of the genotype.

This works a little bit better because this can have values between

negative infinity and zero which is now more like a continuous variable, and

you'll capture more of it with the regression model.

But you can go even farther, you can actually model the log odds.

So here we're going to model again the probability that your case is p.

So we're going to do p divided by 1 minus p.

So that variable can take on a larger number of values and

the log of that variable can take on any value between minus infinity and infinity.

This now, makes sense for a continuous regression model like we have here that's

regressing on the basis of the genotypes.

But now, we have a little bit of difficulty because we're modelling

a relationship about the log of a variable versus the genotype.

And so what are the coefficients that we're estimating now?

This coefficient is interpreted as the increase and

log odds of case status given a genotype.

Let's talk a little about odds and log odds.

They're a little bit tricky to interpret.

So let's start off with a simple example.

Imagine that you can have one of three genotypes.

You can either have two copies of the major allele, one copy of the major allele

and one copy of the minor allele or two copies of the minor allele.

Suppose that in this case,

the phenotype that we're after is whether you died or not.

So suppose that there's a 33% chance you die if you have two copies

of the major allele, 50% chance if you're a heterozygote, and

90% if you're homozygous for the minor allele.

So then what we can do is we can calculate the probability of that phenotype for

each of these different genotypes.

The odds then is the ratio of the probability of death

to the probability of not death.

So in this case it's one to two is the odds,

one over two is the odds of death here.

In this case, since it's a 50,

50 chance the odds is actually equals to one, it's just a ratio of one to one.

And in this case, the probability of death is 90%, and

the probability of not dying is 10%.

And so the odds is actually 9 over 1 or 9.

So the odds is a number that can range from zero to basically infinity.

It could be as big as you want depending on what this probability is.

So the log odds is then going to be the log of that number,

it's going to range between minus infinity infinity.

So here's an example of what an odds ratio of two looks like for

a continuous variable.

So on the x axis here, we have the variable x and

it ranges in values from minus three to three.

And then we have the probability of surviving over ten years.

So again, that's just this little value is the probability of death,

if you have a covariate value of minus three.

And then say at this point,

this is the probability of death if you have a covariate value of zero.

And so what you're doing here is you're looking at the change in the probability

of death associated with this covariate follows this logistic curve.

So you can see that this is the characteristic sort of logistic

curve that you get when you're modeling things on this scale.

And so for an odds ratio too, you can see that there's a relatively linear decline

here in the middle and then at the ends it flattens out.

The change in the in a sort of odds adept.

So you can actually look at what these odds ratio look like for different values.

So imagine that you have an odds ratio of one.

That's basically just this flat line.

That means the odds aren't changing over time.

It's just flat.

Odds ratio of four looks something like this.

It's starting to look like a curve, like that.

And an odds ratio of infinity or an undefined odds ratio,

looks almost like a step function here.

Where at some point, the odds just change.

It goes from basically a probability of zero to a probability

of one at some very specific point.

So this is the way that we can model continuously the odds or log odds.

So how do we interpret that from the model?

So the log odds is the log of the probability of being a case

divided by 1 minus the probability of being a case.

And the odds is just that exponentiated.

So you can actually get that from the coefficients in the logistic

regression models.

So the b coefficient and logistic regression code model is a log odds value.

And if you exponentiate that, you actually get the odds.

So again, recall that the log odds can range from minus infinity to infinity and

the effect of no effect is defined as to be 0.

So if you get a log odds of 0 that's no effect, if you get a log odds of minus 1

or plus 1, then you have a change in the probably in one direction or the other.

The odds on the other hand remember,

if it's a 50, 50 chance that's just equal to an odds of 1.

So that means that there's no change.

And [COUGH] so these two values correspond to each other.

So this is again, a version of a generalized linear model

which actually is a whole other course in regression modeling.

And we're not talking about that in and in too much detail here,

we're going to go through an example with our code, but the basic

idea is that you're trying to model things on the natural scale for that variable.

So here's a nice set of lecture notes on generalized linear models

including a whole bunch of r code if you want to learn a little bit more about it.

This is a huge topic and we've only really scratched the surface.

But it's a really commonly used model and things like genome-wide association

studies and so it's worth knowing a little bit about.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.