An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

93 ratings

Johns Hopkins University

93 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The most common statistical modeling technique used across almost every area ofÂ statistical genomics and genetics is linear models.Â So, I'm going to talk a little bit about what linear models are, butÂ the basic idea is to fit a best line relating two variables.Â So most common thing that you would want to do when you're doing genomics isÂ to take some genomic measurement and associate it with some outcome orÂ some phenotype or some technical artifact.Â So the idea you're trying to relate these two variables andÂ suppose the data are written as Y and X,Â then you can imagine writing a line as b0 + b1 times X andÂ you're trying to minimize the distance between the observed data andÂ the model line as a function of X.Â

In general, you can always fit a line through a set of data.Â The question is whether that line is a good fit or not,Â whether it was a good idea or not to do that linear regression.Â So here's an example of a linear regression that's a really old idea.Â So basically, what they did was they took a bunch of measurements onÂ people that were on the parent heights and the children heights andÂ they may have plotted those two things.Â And then you can draw a line that relates the relationshipÂ between the average height of the parents and the children andÂ see that there's a relationship between those two variables.Â

So, this is actually a really old idea but it still works pretty well.Â It turns out by Victorian methods by doing that plotting the average height ofÂ the parents versus the average height of the children.Â You actually explained more variability that all of the genomic sort ofÂ variance that we've collected can explain.Â So basically, linear regression is one of these most powerful techniques.Â And so I'm going to show you a little example of this,Â using that same data that was collected from that Galton example.Â So here, I've plotted the distribution of children's heights.Â And here, I've plotted the distribution of parents' height.Â So again, for each parent, we take the average of the mother andÂ the father to get the parent height.Â So if I take that information and I want to relate those two things to eachÂ other, there's a couple of steps that you could do.Â First, you could just look at the children's height.Â So suppose you wanted to get an estimate for the average children's height.Â Well, you could do that by just minimizing the distance between each measurement andÂ some number.Â If you take that whatever number that is and you minimize this distance,Â it turns out that the best minimizer is just the average children's height.Â

So the other thing that you can do is you could plot the parents' heightÂ versus the children's height, okay?Â And so here, I've used a jitter to make sure, because some of the parents'Â heights and children's heights are measured exactly the same andÂ you want to be able to see all of the points here.Â So now if I want to find the relationship between these two things,Â one way that I could do that is I could look at different subsets of the data.Â So I could look at just this part of the parents' heights, soÂ parents' heights between 64 and 66, and I could say,Â what's the average children's height for that value.Â Then I could do the same for other values.Â I could say, what's the average children's height for a parent height between 70 andÂ 72 and so forth.Â Another way that I could do that is I could try to fit a line through theseÂ relationships.Â So I fit the best fitting line.Â Now by the best fitting line, again, I mean the line thatÂ minimizes the relationship between the observed data and the line that we fit.Â So, here's an equation for a line.Â If you take the children's height,Â there's an intercept term plus term that's related to the parents' height, okay?Â And so this is the intercept and this is the slope.Â This is we're back to your algebra days.Â So if you make that plot here, again, you see that the line actually isn't perfect.Â Not all of these dots fall exactly on the line, soÂ the line doesn't perfectly describe the data.Â Another way to do this is you can just expand the equation a little bit.Â We'll say that the children's height is equal to that equation from the line thatÂ we had before plus some random noise.Â That random noise is everything we didn't measure.Â Some people think of it as sampling variability, that's one component of it.Â But it's also any other sort of bias orÂ variation that you didn't measure in your data stack.Â So the line that we're going to fit is the one that minimizes the distance between CÂ and the equation from the line that we care about.Â Since we don't know the random variation in general,Â we're just going to minimize the relationship between C and the line fit.Â

So if we do that, we get a line that fits like this, okay.Â And so the first thing that you need to do is you need to ask yourself, okay,Â we fit a line in this data but does it make any sense?Â And one way that you can do that is by taking the residuals.Â So what's that mean?Â That means take this line and calculate the distance between every data point andÂ the actual line itself and make a plot of those.Â So on average those residuals are going to come out to average to be zero.Â But then they're going to be, there's going to be a distribution forÂ every different parent height.Â So for a parent heights between 64 and 66, this right here is the distribution ofÂ the residuals, between 70 and 72, here's the distribution of residuals.Â

Ideally, you would like to see that there's a similar variability atÂ every different parent height.Â And you would like to see no big outliers and you would like to see them centered,Â kind of nicely around zero.Â That means that the line is fitting pretty well.Â There's actually a whole set of residual diagnostics that you can do toÂ check to make sure that the lines fitting well.Â But the things that you're definitely looking for are outliers,Â distributions that are skewed and you're looking for any clusters of points thatÂ appear to cluster, say away from these line when you're looking at the residuals.Â You can color it by, color these dots by a whole bunch of otherÂ different variables and see if there's a diagnostic forÂ why maybe the linear regression isn't working very well.Â Keep in mind again that you can always fit a line butÂ the line doesn't always make sense.Â Here again is Anscombe's quartet, so all of these lines are the same exactÂ line with the same exact parameters and significance and everything else.Â So you get the exact same intercept and slope estimates.Â But here for example, you see a curvilinear relationship.Â Here you see a crazy outlier and again a crazy outlier right here.Â So what you're looking for when you're fitting a line,Â this is what you're sort of expecting to see, a sort of a scatter plot of points,Â then it's a cloud of points like that.Â If you see more specific relationships, you know that you have to do a moreÂ specific model, a model that either accounts for quadratic variation orÂ a model that accounts for outliers.Â To account for the fact that it's not really just a linear regression line thatÂ you're actually supposed to be fitting there.Â So this is actually a whole class.Â I've done a lecture on it.Â I'll do a couple more lectures butÂ it's sort of a very quick overview of regression models.Â If you take the regressions model course in the John Hopkins Data ScienceÂ Specialization, you'll cover a whole bunch of diagnostics and ideas.Â We've covered the basics here so you'll know what to fit,Â but the diagnostics require a lot more intuition and thinking.Â The basic thing to keep in mind, though, is does the line fit?Â Is it make sense?Â Not just does it fit statistically but does it make sense to fit a line?Â And then there are great additional notes in this book here andÂ in the corresponding class on linear models andÂ the class on statistics for the life sciences on edX.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.