An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

123 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

In previous lectures, we've talked about a linear regression model that relates one

outcome to one covariate.

But in general in a genomic study, you might have measured many covariates.

Say, technological covariates like the batch effects or biological variables such

as the demographics of the samples that you collected.

And you might want to adjust for those in your linear regression model.

So we're going to talk a little bit about how to do that with a very simple example

from the Millennium Development Project.

Here we're going to be plotting the percent of children that are hungry.

That's what's going to happen on the y-axis here,

this is the percent of children that are hungry, versus year on the x-axis.

So what we're going to do is we're going to fit our standard

linear regression model here.

So here we're relating the percent that are hungry to a linear function

of the year plus an e term that represents all of the measurement error and

all of the variability that we got from sampling.

So one way that we can try to model some of that error that we didn't include in

this initial regression model is to color the samples by whether they're

a male or a female.

So here I've colored the males by red and the females by black.

And so what we can do is fit a new regression model.

And this new regression model says the percent that are hungry is a linear

regression on the year plus a term for whether you're female or male.

F is equal to one if you're a female and zero if you're a male,

like we saw with the categorical covariates, plus

an error term that represents everything that we didn't model using this model.

So here one thing to note is that if the F term is equal to zero,

then this term cancels out, the b2F term cancels out and

you're left with this linear regression model.

Similarly if this F term is equal to one then you still have

a new constant here which can be put into the intercept term and

you still have the same regression model that you're trying to fit here.

So you end up with two regression lines that are parallel to each other when you

fit this sort of regression model.

So now how do we interpret these different coefficients?

So b0 is the percent that are hungry at year zero for males.

Because the F variable is equal to 0 and the y variable is equal to 0.

b0 plus b2 is the percent hungry at year zero for

the females because now the F variable is equals to 1.

And b1 is the change per year in the percent that are hungry.

So for a one unit changing year,

what's the one unit change in the percent that are hungry?

And ei again is everything that we didn't measure.

Ideally, it will have a variability distribution that we can model carefully.

So this is the way that you often fit these regression models.

So you have to be a little careful about how you interpret the coefficients once

you've fit adjustment variables especially if you're adjusting for many variables.

The other thing we saw is that there you were fitting parallel lines, but

sometimes you want to fit not just parallel lines.

So the way that you do that is with interaction terms.

So here I'm going to switch to a different example, so

here it's expression plotted on the y-axis and genotype plotted on the x-axis.

And so here are the different genotypes that you could have.

You could have a homozygous minor allele,

heterozygous where the minor allele copy comes from one sample, heterozygous

where the minor allele copy comes from the other example, or homozygous major allele.

And here you can see, if you fit a linear regression model,

it might fit pretty well.

And if you fit adjustment variable it might not necessarily work.

Because you again,

would get two parallel lines which don't necessarily fit this data very well.

But what you can do is you can include a term where you multiply the indicator

variables for the RM effect and the BY effect and

then you could get different levels of the means for each of the different values.

So this is fitting an interaction model.

These things can get very tricky especially when you're dealing with

continuous covariates, particularly the interpretation

of what the parameters mean for each of these values.

And so if you're fitting these more complicated regression models

it's worth taking a long time to think about

exactly what does each of the beta coefficients mean?

Again linear models is a whole class and

adjustment variables are a whole huge subset of that class.

So I'd encourage you if you're interested in this and you have to fit very

complicated linear regression models to take a linear regression modeling course.

The basic thing to keep in mind here is how many levels do you want to fit?

What makes sense biologically?

How do you want to relate?

How many covariates do you want to basically include in your model and

adjust for?

And there are great additional notes again in this Statistics for

the Life Sciences class.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.