An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

129 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Throughout most of this course, we've talked about statistical inference for

genomics, but there's also this idea of statistical prediction or

machine learning for genomics.

So recall that the central dogma of inference is basically that we're going to

have this population and

we want to use probability to sample from that population.

So once we get that sample,

we're going to try to say something about this global population.

So this is sort of a population level analysis.

By contrast you can think of sort of the central dogma of prediction, is you take

some sample from a population again and you build that into a training set,

where you have two different kinds of things that you're trying to predict,

and then you use that data to build a prediction function.

And so once you have that prediction function, if you get a new sample and

you don't know what color it is, that function assigned it to one of the two

colors based on some of the properties.

And so prediction is a little bit different problem than inference and

we haven't covered too much about it, but wanted to cover just a little bit about

some of the key issues that often come up related to prediction in genomics.

So the first thing to keep in mind, is that inference and

prediction can give you very different answers, totally sensibly.

So here's an example, suppose we want to test for

the differences between the values between two different distributions,

and we collect a whole bunch of data.

If you do inference, and you ask are these two populations different?

In this case, they're definitely different from each other,

the distributions are very different from each other.

But they're not necessarily very predictive.

So imagine that I wanted to predict which of the two distributions the data

point came from.

If it came from sort of out here, you might be able to predict, oh,

it's maybe a little bit more likely to come from the light gray sample than

the dark grey sample.

But if it came from here, it's not very predictive at all.

It's sort of could be very likely to be either the dark grey sample or

the light grey sample.

On the other hand, this is another case where inference would definitely tell you

that there's a difference, just like it would in the previous example, but

here it's much more predictive.

Basically if you have any data point out here,

it's going to be easily assigned to one of the two distributions.

But a data point here in the middle might not necessarily be assigned, but

that's just a very small fraction of the cases.

So the first thing to keep in mind, is that in the case of inference we're

looking for differences that may or may not be predictive.

So if you do, say, a differential expression analysis,

you might identify lots of differences.

Many of those might not necessarily be good for prediction.

So the other thing to keep in mind is the quantities of interest.

So suppose that you're doing genomic tests and you have some disease that you

want to test for, then the quantities that you care about are the case where the test

says you have the disease and you actually do, that's a true positive.

Or the test says that you don't have the disease but

you don't have the disease, that's a false positive.

Or the case where the test says you do not have the disease and you actually do,

that's a false negative.

And then the case where the test says you do not have the disease and

you actually don't, that's a true negative.

So usually people in the genomics talk a lot about false positives when they're

talking about inference.

And they also talk about true positives.

But in prediction you need to sort of carefully balance how these

different potential categories work.

So here's a really simple definition of some of the key quantities of sensitivity.

You might hear about the sensitivity of a test.

That's the probability that you get a positive test given that

you actually do have the disease.

Specificity is the probability of a negative test given that you don't

have the disease.

And then the positive predictive values is, so if I do have a positive test,

how is it likely that I actually have the disease?

Same with the negative predictive value.

And then the accuracy is just the probability that you the correct outcome.

That's sort of the sum of the true positives here and

the true negatives here divided by the total number of cases.

And so, here you're going to again define all of these sort of things in terms of

the true positives, false positives, false negatives, and true negatives.

So, for example, sensitivity is the TP / (TP+FN).

So these definitions that I'm showing you here, in terms of these quantities,

correspond to the probability definitions that you saw on the previous screen, so

that probability of a positive test given that you have the disease.

Here, (TP+FN) here, so the (TP+FN)

are all the cases where you have the disease.

And then you're looking at the fraction of the time where you actually identify them.

So that's TP / (TP+FN).

Okay, so let's use an example.

This just sort of illustrate how any kind of screening can be tricky, but

particularly geometric screening can be tricky.

So, assume that there is a disease in it.

Only about .01% of the population have that disease.

And so we have a test that's 99% sensitive.

That is, if you have the disease with 99% of the time,

it will say you have the disease.

And it's 99% specific.

So, that means that if you don't have the disease, then 99% of the time,

when you don't have the disease, it'll say that you don't have the disease.

So, that seems like a pretty good test.

So, the question is, what's the probability

of a person having the disease given the test result is positive?

In other words, what's the positive predictive value of this test?

So we're going to consider two cases, a general population where the rate of this

disease is 0.1% and then a higher at-risk sub-population.

So in the general population this is what it might boil down to.

So remember, it's a very accurate test.

So if you have the disease, 99% of the time it'll tell you that you have it.

And then if you don't have the disease,

99% of the time it'll tell you that you don't have it.

But these numbers are a little bit sort of unbalanced because almost no one has

the disease.

It's a highly rare disease, so if you actually go calculate the sensitivity and

the specificity, they're both very high, just like we expected.

But the positive predictive value is only 9%.

Why is that?

It's because your testing a huge number of people that don't have the disease, so

even though you only get a tiny fraction of those to be false,

there's a large number of them because you tested so many.

So, it turns out that the positive predictive value or

that the probability you actually have disease, if we tell you you have

the disease, is only 9%, which might not be that great for lot's of reasons.

One we might give you all sorts of treatments you don't necessarily

want to get.

For two, you might be nervous or scared because we told you you have the disease,

even though it's actually kind of unlikely that you have the disease.

Even though the test is, in this case what a lot of people consider to be a really,

really sensitive and specific test.

Now except for sort of rare disorders and some very specific variations, it's

very rare that you would get these numbers to be this high in a genomics experiment.

Typically the sensitivity and

the specificity are relatively low compared to what we're showing here.

And, so this effects even sort of what people consider to be really used in

quite strong screening tests.

So, for example, when you're looking at mammogram screening,

particularly in young women or same thing for prostate cancer.

If you're sort of doing PSA screening in younger men, it turns out that when you do

this sort of screening, even though the test might be pretty good,

you're just testing so many people, most of whom who don't have the disease.

You'll get lots of false positives,

which will lead to, sort of potentially consequences.

In particular the consequences tend to relate to how much money people

spend on downstream therapy, and how much difficulty they go through for

downstream therapy.

So one way to address this, particularly this is useful for genomics but

also other areas,

is to basically go to a population where there's a higher risk of the disease.

So in this case now, we have again a 99% sensitive and

specific test, but now we've gone to a situation where we're at

a higher risk of that disease in the population overall.

So you can see, now there's 10,000 potential people who have the disease,

compared to 100,000 who don't necessarily have the disease, so

the frequency of the disease in the population is higher.

And so if you do goes calculations again, now you have a 99% sensitive and

specific test.

But you don't get overwhelmed by the fact that there's so

many more not diseased people that diseased people and

your positive predictive values stays pretty high.

So this is one example of the ways that it can be a little bit tricky to do screening

or to use genomics measurements for prediction.

This is again a whole class, so

I'm linking here to one class that I actually teach on Coursera, but

there are other really good prediction and machine learning classes.

And then this is sort of the idea that's underlying precision medicine.

Some of precision medicine is focused on sort of rare diseases and

Mendelian disorders, where it's a little bit more targeted, and

you tend to get much higher sensitivity and specificity.

But particularly for precision medicine for common complex diseases,

these are the issues that will come up.

And so far this has sort of been a major challenge for

genomics, it's sort of an open area where a lot of people are working.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.