An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

92 ratings

Johns Hopkins University

92 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Permutation is one of the most widely used tools forÂ assessing statistical significance in genomic studies, soÂ I thought I'd explain the principal behind permutation andÂ then we'll talk about how you calculate permutation p-values in a future lecture.Â So here we're looking at the erythroid differentiation signatureÂ that predicts response to lenalidomide In Myelodysplastic Syndrome, and again youÂ have to be a little impressed that I got some of those words close to right.Â So in this study, you again have some genes that you measured, soÂ gene expression that you measured, in many of these rows, andÂ you have patients that you've collected, so there's a number of patients.Â And some of them are responders, and some of them are nonresponders.Â And so the idea is that you're going to be comparing the responders toÂ the nonresponders for every gene.Â And you might calculate a statistic.Â For example, you might calculate the T statistic comparing the difference inÂ mean expression level for the responders and the not responders divided by orÂ standardized by some measurement of their variability.Â So now that we have this statistic,Â we want to know how extreme it would be if there was no relationship at all.Â And so one way to break the relationship between the response andÂ the gene's expressions levels is to permute the labels.Â So one thing that we could do is we could randomly scramble the labels.Â So that's what we've done.Â When we moved from the left hand side to the right hand side here,Â we just completely scrambled the labels totally randomly.Â And when you do that, there should now no longer be a relationship betweenÂ the response variable andÂ the gene expression measurements, because we've assigned the labels at random.Â

And so it turns out that this is a good thing to do to permute the labelsÂ than permute each gene expression level, because it leaves the geneÂ expressions levels, the relationship between those levels connected.Â And that's good, or it leaves those intact because you might need to modelÂ that relationship later on in the modeling process, which we'll talk about later.Â And so the idea here is that you do this permutation andÂ then you recalculate a statistic for each gene.Â So if you calculated the original statistics, say, for gene one, andÂ it was equal to 2, that would be where the original statistic is.Â Then you permute the labels every time, and you recalculate the statistic.Â You hope that it would be centered near 0, because there should be on average,Â no difference between the two groups once you permuted the labels.Â And you can see how extreme this statistic is with respect to those permutedÂ statistics.Â And if it's really extreme you might think, oh, well then it's not likely thatÂ this statistic comes from this distribution, and if it's not very extremeÂ you think, oh, well it might be coming from that distribution.Â So this permutation idea is used all the time in genomics.Â It's used not just for the simple comparisons but for network comparison,Â for enrichment comparisons all of the time all over the place.Â And it assumes that if you switch the labels the data come from the exact sameÂ distribution.Â So by permuting the labels we're sort of making the assumption that the labelsÂ don't matter.Â That that gene's expression levels are completely independent of the labels.Â And it's not necessarily just a comparison of means.Â So that permutation statistic we calculated,Â the T statistic, is calculating a distance between the two means.Â But by permuting the labels, we're actually making that distribution,Â we're assuming that the distribution is exactly the same.Â So that T-statistic will actually find any difference if you do this permutationÂ approach, any difference including in the variance orÂ any of the other moments of the data of the generating distribution.Â So permutation is actually quite a complicated topic.Â We've covered it just very briefly here,Â we'll cover it a little bit more in the assessments.Â But you can learn a little bit more about it in this advanced statistics forÂ the life sciences course.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.