An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

124 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The most common [INAUDIBLE] analysis is to do an expressive quantitative trade locust

analysis, or analysis, where you're combining gene expression

information with genetic information to try to identify association.

So I'm going to show you an example of that using the matrix package.

So here I'm going to set my parameters like I usually do and

then I'm going to load the packages that we're going to need,

in this case the primary driver of most of this analysis is the MatrixEQTL package.

That means you can see there, but they're sort of strange caps,

you need to all caps for EQTL and caps for Matrix.

So then what I can do is I can basically load some data from this.

I'm going to load snp data, expression data, and covariant data.

The snp data is the genotype information, the expression data the gene

expression information, and the covariants are anything you might want to adjust for.

I'm going to do that by copying these lines here.

These are basically just finding out the location of the package.

So it's the base directory.

And then I'm going to paste together that directory to

basically different text files to basically pull them out.

So if I look at the base directory it's basically finds the base directory on my

computer and then the snip file is going to basically paste that together

with some downstream information that we might need.

Just make that txt.

And then it's going to say paste those together with no separation between them.

And so you can see it creates like a file name that goes straight to the snip.txt.

So then I can read those in with read.table.

So I'm basically going to read in the expression file with read.table

expression.

File name, I'm going to tell it that the separation is a tab separator.

It's got a header.

And it's has a real name that is equal to one.

I knew that because I'd already looked at the files in advance, but in general,

you might have to check and see if they had a header, or if they have a real name.

So, here you go.

So now we see the expression data.

We can do the same thing with the snips data, we can read it in.

Read it in from basically the R package itself, and

again this one has a header and it had real names.

So, I can do the same thing, so

you can see the genotype information is coded as zero one two.

The expression measurements are quantitive.

I can do the same thing for the covariants here.

And so I can look at the covariants which has information on the gender and

other information you might care about for the people, for example, their age.

Okay, so then, basically the idea behind an eQTL analysis is to do a linear

regression relating gene expression information to genotype information.

So we could do that one by one.

So here's how you'd do that.

You could basically take the first genes expression levels

by extracting them out from the first row of the expression matrix.

And you can extract out the first set of snps from the snps matrix.

So now add e1, s1, and now I just want to relate those two things together,

you could use a linear model.

And so, if you look at that You see that there might be a relationship here between

this gene's expression levels and SNIP because it has a relatively

reasonable effect size but the p-values too big.

So it's probably not highly associated.

So then what we can do is we can actually plot the data to see what this looks like.

So what I'm going to do is I'm actually just going to plot the expression levels

vs the genotype, so here I'm plotting expression level vs a jittered version of

the genotype so just so the points don't all land on top of each other and

then I'm going to color it by the genotype.

So you can see here they're colored by genotype and then

I'm going to add an access label so you can see which genotypes they are, and

then you can plot the fitted values versus the observed genotypes, and in

this case I'm going to color that in dark gray, and see here's the fitted values.

You can see that this is the average in the homozygous major the heterozygote,

and the homozygous minor In this case I've assumed an additive model

since I've just indirectly included the snip.

So you can see it's the same difference between the homozygous major and

the het, as the het and the homozygous minor allele.

So now we can set it up to do this on every single case.

because if you wanted to use LM, in this case the data sets are really quite small.

The expression data has only got ten genes in it, but in general,

you might have tens of thousands or hundreds of thousands of genes or

axons that you're testing, and equivalently, millions of snips, and so

you're doing billions and billions of regression models.

If you do that with LM, it will take a long time to do it, so

instead, we can use this matrix eQTL package.

So, the first thing we need to do is, we need to set up some parameters.

And so the P value threshold is at how statistically significant do you need it

a P value to be before it actually saves the output.

So it's basically going to throw away everything that just,

throw away the calculations for everything that are above the threshold and

that'll save a lot of computational time and space.

So you should choose this to be as liberal as you need to catch everything that

you think might potentially be interesting.

Even if It doesn't have to be just the ones that are at the most extreme

significance levels if you want to look at the broader distribution.

The error covariance, if you just set it to the numeric, that's basically saying

that they're after I kind of of the genotypes and the covariance.

There is essentially an independence error model for

the gene expression variation which is usually the most common assumption.

And then it tells you which model,

you have to tell it which model to use, in this case.

It's going to say, use the linear model, so just use the standard

additive linear model rather than trying to do a dominant and recessive model.

Okay, so then once you've got those general parameters set up,

you have to set up the files, matrix CTTO Because it's, it is very, very fast.

But to do that it has to sort of be organized about how it's going to

analyze the data.

And so here we have to create a new sliced data object, both the snips and

the gene expression information.

And so to do that we have to tell it what the file delimiter is.

So this is basically the file you're reading in.

If it's got Tab separated.

Then you give it a file delimiter of tab.

What to miss like if it's at what's the definition of a missing value?

How many rows and values to skip.

So in this case the data set had both a column

set of column names and a set of row names, so we want to skip one line each

because it's just going to test the values that are numeric.

And then the file slice size tells you something about,

it's going to basically break the files up into chunks so that it can be computed

on more easily in R, and so this tells you about how big the chunks are going to be.

Sizes of 2000.

The bigger the chunks, probably, the faster I can compute, but

the slower it is because it has to load more data in, so there's a balance there.

2000 seems like a reasonable compromise here.

In this case, there's no need to chunk it up, because the data sets are so

small, and so then I just load the file name in.

Using all those parameters I just defined.

I do something very similar for the D expression data,

to create another sliced data object.

So here it's 15 snips and the ten genes and in the case I'm not going adjust for

any covariance, you can do the same thing with the covariance but I'm just

going to say It's an empty object, so we're not going to do any adjustment.

So then here's the main command for running a matrix_EQTL, so matrix_EQTL has,

gotta be careful about the capitalizing, capital M And lower case e capital QTL.

Then you pass it the snips slice data object, the gene slice data object,

the covariance slice data object.

If you want to,

you can have it directly output the output to a text file rather than output into r.

This can be useful because sometimes these computations,

especially if you have many genes or many snips can take a long time to run.

And so you might just want to get this running in the background and

then come back to it and have it output the results.

The p value threshold, what model to use, the error code variance.

In this case we want to be able to look at the p value distribution for

all the different snips.

This will slow it down a little bit and require a little more memory,

but it allows you to do a better analysis,

deeper analysis, so I suggest always lead P value equals true, and

then you can tell it whether to calculate the false discovery rate or not.

So this case, it ran very quickly, there's hardly any steps in genes, but

it's still incredibly fast even if you do many analysis.

And so the first thing you can do is you can sort of look at the object

that gets created.

And it saved the p-values for all 150 tests.

So there's 150 tests because there are ten genes and 15 snips.

And so, it's basically testing every possible combination of snips and genes.

So that's 150.

10 times 15.

And these are the p-values from all of them.

Here, they're relatively flat so you see that there's probably not much

statistical significance, but also maybe there's no artifacts.

The ME object that comes out has several components to it.

It tells you how long it took to run and it tells you something about

the parameters that you use to run the analysis, so you can save those as well.

And then the all component tells you the number of EQTLs it calculated.

How many tests it did.

So, if I look at the number of tests,

it calculated 150 tests just like we talked about.

And then you can look at the number of EQTLs it found.

So, that's the number that passed that significance threshold.

And then, the next step is looking at the eQTLs themselves.

Here it is. It tells you which SNP, which gene,

what the statistic was, what the p value was, the false discovery rate, and

the estimate of the association statistic.

So the next step with this is to go back and check.

In general you gotta check to make sure that there are not any artifacts, to make

sure that the plots look reasonable so you can go back and plot the ETTLs after

you've done this and make sure that they really do appear to be associated.

But this is the first step in doing ETTL analysis is actually just doing

all the possible comparisons between all SNIPs and all genes.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.