An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

101 ratings

Johns Hopkins University

101 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

By it's nature genomics data is usually very high dimensional, and so

Â you want to reduce that dimension when visualizing or modeling the data.

Â So here I'm going to do the typical set up steps to get my plotting

Â parameters like I like them and to load the libraries that we'll need.

Â In this case, it's mostly the base packages that we're going to be using, and

Â then I'm going to load in a data set here from this URL.

Â There's actually a combination of two data sets from Montgomery and

Â a pick roll paper.

Â And so they are actually two different populations measured in two different

Â labs and that'll be useful for this lecture.

Â And so I'm going to load the data in and I'm going to, again, extract out

Â the phenotype data, the expression data, and the feature data for this data set, so

Â that we can use that data to make some plots and to do some dimension reduction.

Â So again, just to make it a little bit easier visualize,

Â I'm going to sort of subtract out all the rows where

Â the row mean is less than 100, so I reduce the size of the data set.

Â And then I'm going to apply the log transform so

Â that it will be on a scale that's a little easier to work with.

Â So the next thing that I'm going to do is I'm going to actually center the data,

Â because when we're doing the singular value decomposition,

Â if you don't center the data, if you don't remove the row means of the data center,

Â the column means of the data set,

Â then the first singular value of your vector will always be the mean level,

Â since that will always explain the most variation in genomic experiment.

Â And we actually want to see variation between samples or between genes,

Â so we're going to remove that means versus variation and

Â look at the ones that are different between genes.

Â And so once I got that center data set,

Â I can apply the svd function to calculate the singular value decomposition.

Â So this singular value decomposition has three parts to it.

Â These three matrices d, u, and v.

Â So d is the diagonal matrix, and so

Â it just returns the diagonal elements of that matrix for you.

Â So there's, in this case, the data set that we're dealing with

Â has 129 columns, so there's 129 singular values.

Â And then the other components, the p and u components,

Â have 129 values for the v component.

Â So that's basically telling me something about the variation across genes, and

Â then the variation across samples is something about you.

Â And so the first thing that we might want to do is plot the singular values.

Â And so we're going to plot the d values and this would be the singular values.

Â And I'm going to make those in blue.

Â So here I can see those singular values plotted versus their index, so

Â they're ordered from the biggest to the smallest.

Â And so then the next thing I want to do is plot the variance explained, and so

Â to do that remember that I have to calculate each singular

Â value squared divided by the sum of the singular values squared.

Â And so once I've done that, I've calculated the variance explained.

Â And so, I can plot that, again in blue, on the same kind of plot, and

Â I can see that the first singular value explains more than 50% of the variant.

Â So it's a highly explanatory variable.

Â So then I can make a plot of that and see what could that variable be?

Â Again, I'm going to make a plot that's two panels, so

Â I use the par m f equals one two.

Â And then I'm going to plot the first two igungenes or

Â right singular vectors, or principal components.

Â You'll see in a minute that they're not exactly the principal components but

Â people use them sort of interchangeably.

Â So I plotted that first principal component and then I plot the second one.

Â And so the first thing that people often do is they might want

Â to color these by different variables to see if there's something going on.

Â To do that they usually plot.

Â It's very common to plot the first singular vector versus the second

Â singular vector, right singular vector.

Â So here I'm going to set it up so that there's a one-by-one plot again.

Â And so if I make that plot, I can see there's this pattern here, and

Â the thing that people often do is they make this plot,

Â only they color it by a particular variable.

Â So in this case, I'm going to color the PCs by what study they come from,

Â so here I'm setting the color to be the numeric version of the study variable.

Â And so I remove the color from the previous plot, and so you can see here, if

Â you look in the PC1 axis, the two studies have very different values of the PC.

Â So it seems that one of the big sources of signals in the data set is which study

Â the two data sets come from.

Â You can see this also, a way that people often do this is to make a box plot of

Â that first principle component, because you can see that's the one that separates

Â the two studies versus the study variable.

Â And then it's always a good idea to show as many of the data points as possible so

Â you can overlay the data points on top of the box plot by

Â plotting the same singular vector versus a jittered version of the study variable.

Â In coloring it by the study variable, you can see that there's a big difference in

Â that first principal component between the Montgomery and the Pickrell studies.

Â So, that's how you do the singular value decomposition.

Â So, to do the principal components you can use the PR comp function and

Â apply it to the same data set.

Â And so, even though I've been sort of using the two terms interchangeably they

Â are not quite the same thing.

Â So if I plot the first principal component verses the first

Â singular vector they're not the same thing, and

Â that's because I haven't actually scaled them in the same way.

Â So it turns out if you actually scale the data by removing, so now I'm

Â doing a second set of centering, but here what I'm doing is I'm actually subtracting

Â the column means rather than subtracting the row means.

Â Then I have a data set that's centered by column instead of centered by row,

Â and so then I can calculate the singular value decomposition on that.

Â [NOISE] And when I do that and then I plot the first principal

Â component versus the first singular vector from the column center data,

Â I actually get that they're identical to each other.

Â And so basically what's going on is that if you column center the data then do SVD,

Â you get exactly the principal components because the principal components

Â are calculating something about the variability between the columns when

Â they're doing that.

Â And so you can get PCs and

Â SVDs that actually compute the exact same thing if you do the centering right.

Â One thing to keep in mind is that outliers can really drive these decompositions,

Â so to illustrate that I'm going to just take our edata centered,

Â I'm going to assign it to the new variable edata outlier.

Â And then I'm going to make one of those values really outlined, so

Â I'm going to take this sixth gene and I'm going to multiply it by 10,000.

Â So this is now a very outlined gene of very high values.

Â So now I'm going to apply the SVD to the outlying dataset and

Â if I plot the original version of this decomposition where I did this

Â SVD on the dataset without this outlier versus the dataset with the outlier,

Â so then I can see that [NOISE] they don't match each other anymore.

Â You sort of don't see that the two data sets don't necessarily match

Â in terms of their singular value decomposition, but

Â you can definitely see that the singular value, or singular vector for

Â the composition with the outlier reflects that outlier quite accurately.

Â And so if you plot the first singular vector from this

Â new decomposition with the outlier verses the outlying value itself you can see that

Â they're very highly correlated with each other.

Â So what's happening is the decomposition is looking for patterns of variation well

Â if one gene is way higher expressed and on the other ones, then it's going to drive

Â most of the variation in the data set and so it'll be very correlated with it.

Â So you have to be careful when using these decompositions to make sure that you

Â pick the centering and scaling so that all of the different measurements for

Â all of the different features are on a common scale.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.