An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

124 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

A unique thing about regression modeling in genomics is that you

often fit many regression models simultaneously.

The reason why is you usually have many measurements, and so

each of those measurements is designed to be able to be correlated with

an outcome that you care about.

So here, for example, is the typical genomics data set.

You have a large number of features in the rows.

So that could be tens of thousands or millions of features,

whether they're SNPs, measurement of methylation, MCPGs,

gene expression levels, or transcript expression levels.

And then you have some varying conditions.

And usually you have some kind of phenotype like case control status.

You would like to associate each feature with case control status and

you would like to discover those features that are differentially expressed or

differentially associated with those different conditions.

So to do this,

you usually end up with a matrix formulation of this same regression model.

So you can imagine that, for every single row of this matrix,

you'll fit a regression model that has some B coefficients multiplied by some

design matrix, multiplied by some variables that you care about,

plus some corresponding error term for just that gene.

And then you would stack a bunch of these up.

So this is a bunch of stacked regressions.

I'm showing it here in mathematical notation on the bottom.

You write matrix multiplication, to write down these many multiple regressions.

And then I'm showing it in block format up here.

So you model the data for this gene, Based on these coefficients multiplied by

these variables multiplied by, they're adding up this error term right here.

And you do this for every single feature that you're modeling.

So here's this example where you are looking at gene expression signatures

associated with geography in a particular population in Morocco.

And so, there's a primary biological variable that you might care about or

variable that you care about in this case might be where they actually come from.

What's the geography that they come from?

And then you might have a bunch of adjustment variables,

like are they males or females?

What batch they come from?

All sorts of other variables that you might have.

And so the model actually becomes a little bit more difficult when you're dealing

with such a case because there's all sorts of variables that you obviously need to

model like the location that the people come from, their sex, the batch.

There's also a much more subtle effect.

Say, intensity dependent effects in the measurements from the genomic data or

dye effects or probe composition effects since this is a microarray.

And then many other unknown variables that you might want to model.

So when you do this, you actually end up with a slightly more complicated model.

Again, this is in colored blocks, the observation version of this.

And so again, you might model the measurements for one gene that are in one

row as a function of the coefficients in one row times the set of variables that

you actually care about, in this case it might be geography.

Plus the coefficients in one row for a set of adjustment variables that you might

care about plus the random variation for that one row.

So now you've got a model that you're fitting many, many regression models.

You fit them all the exact same way as you fit a single regression model but

now you have to interpret them jointly.

And so there's a couple of different things that are difficult.

One is that you have hundreds, thousands or millions of model fits

at the same time and for each one we have estimates of the variables, the residuals,

we have the fitted values and there can be structure in any of those things.

There can be structure in the estimates, there can be structure in the noise and

there's all sorts of issues that may be due to different values of the covariance

and different unmeasured confounders.

So the key here is that we need to think of linear models as

one tool that can be applied many, many, many times across many different samples.

So there's actually a class on regression models that you can look at, but

I also highly recommend this paper on linear models for microarray data.

While there's talking specifically about microarray data,

and obviously there's new technologies that have been developed since then,

this is a really nice treatment of all the issues that come when you're doing many,

many regression models simultaneously.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.