An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

109 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Permutation is one of the most widely used tools for

Â assessing statistical significance in genomic studies, so

Â I thought I'd explain the principal behind permutation and

Â then we'll talk about how you calculate permutation p-values in a future lecture.

Â So here we're looking at the erythroid differentiation signature

Â that predicts response to lenalidomide In Myelodysplastic Syndrome, and again you

Â have to be a little impressed that I got some of those words close to right.

Â So in this study, you again have some genes that you measured, so

Â gene expression that you measured, in many of these rows, and

Â you have patients that you've collected, so there's a number of patients.

Â And some of them are responders, and some of them are nonresponders.

Â And so the idea is that you're going to be comparing the responders to

Â the nonresponders for every gene.

Â And you might calculate a statistic.

Â For example, you might calculate the T statistic comparing the difference in

Â mean expression level for the responders and the not responders divided by or

Â standardized by some measurement of their variability.

Â So now that we have this statistic,

Â we want to know how extreme it would be if there was no relationship at all.

Â And so one way to break the relationship between the response and

Â the gene's expressions levels is to permute the labels.

Â So one thing that we could do is we could randomly scramble the labels.

Â So that's what we've done.

Â When we moved from the left hand side to the right hand side here,

Â we just completely scrambled the labels totally randomly.

Â And when you do that, there should now no longer be a relationship between

Â the response variable and

Â the gene expression measurements, because we've assigned the labels at random.

Â And so it turns out that this is a good thing to do to permute the labels

Â than permute each gene expression level, because it leaves the gene

Â expressions levels, the relationship between those levels connected.

Â And that's good, or it leaves those intact because you might need to model

Â that relationship later on in the modeling process, which we'll talk about later.

Â And so the idea here is that you do this permutation and

Â then you recalculate a statistic for each gene.

Â So if you calculated the original statistics, say, for gene one, and

Â it was equal to 2, that would be where the original statistic is.

Â Then you permute the labels every time, and you recalculate the statistic.

Â You hope that it would be centered near 0, because there should be on average,

Â no difference between the two groups once you permuted the labels.

Â And you can see how extreme this statistic is with respect to those permuted

Â statistics.

Â And if it's really extreme you might think, oh, well then it's not likely that

Â this statistic comes from this distribution, and if it's not very extreme

Â you think, oh, well it might be coming from that distribution.

Â So this permutation idea is used all the time in genomics.

Â It's used not just for the simple comparisons but for network comparison,

Â for enrichment comparisons all of the time all over the place.

Â And it assumes that if you switch the labels the data come from the exact same

Â distribution.

Â So by permuting the labels we're sort of making the assumption that the labels

Â don't matter.

Â That that gene's expression levels are completely independent of the labels.

Â And it's not necessarily just a comparison of means.

Â So that permutation statistic we calculated,

Â the T statistic, is calculating a distance between the two means.

Â But by permuting the labels, we're actually making that distribution,

Â we're assuming that the distribution is exactly the same.

Â So that T-statistic will actually find any difference if you do this permutation

Â approach, any difference including in the variance or

Â any of the other moments of the data of the generating distribution.

Â So permutation is actually quite a complicated topic.

Â We've covered it just very briefly here,

Â we'll cover it a little bit more in the assessments.

Â But you can learn a little bit more about it in this advanced statistics for

Â the life sciences course.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.