An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

166 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The first thing that you want to do when you get a dataset is explore it.

So I'm going to talk a little bit about how I explore data when I get it,

genomic data in particular.

So the first thing I'm going to do is I'm going to setup the plots so

they'll be prettier when I make them.

And so the way that I'm going to to do that is first I'm going to define

some set of colors.

So here I'm going to call them these tropical colors, this is inspired

by the RSkittleBrewer package written by my former student, Alyssa Frazee.

And so then what I'm going to do is I'm going to

tell R that I want it to use that set of colors.

So I can use the palette command to tell R that I would like it to use those colors,

look for those colors when its making plots.

The other thing that I do is I usually set it up so that the circles are filled in

rather than open circles when making plots in R, which is the default.

So you can do that with par(pch=19).

So the next thing that I'm going to do is I'm going to load in the packages that

you need for this example, and so I'm not going to go through all these.

Library, I'm not going to type all the commands out, but you need the gplots,

devtools, Biobase, RSkittleBrewer, the human org database, and

the AnnotationDbi package.

So now that I've got all that loaded I'm going to load in the data as well.

And so here I go, I'm going to load in the data from a connection.

Again, I'm going to make the connection to this

URL which you can follow along on the exploratory analogous markdown file.

And then once I've done that, I'm going to load the file from the connection.

And that's going to take a second, and

then I'm going to close the connection because I'll be done with that, oops.

And then I look and I see that I've got the body map expression set here.

And then to simplify my life, make the variable name a little bit shorter.

So I'm going to assign it to this bm variable, and

then I'm going to extract out the different pieces that I need.

So first I'm going to extract out the phenotype table with pData, and

then I'm going to extract out the expression data with the exprs command,

and then I'm going to extract out the feature data with the fData command.

So now I have these new variables that I've defined that make it a little easier

to access the data, that'll make it a little easier to make these files.

So the first thing that you want to do is do some really basic checks to

make sure everything seems to be in order, and

so the first thing that I do is I start to make tables.

Particularly I start usually with the phenotype data.

So here I'm going to make a table of the gender information

from the phenotype table.

And I see that there are eight males and eight females, and

then I also start to make cross tabs of, so here, I've made a table of gender.

And I might want to compare that with the race of the samples.

And so you can see here that there's one African American sample

that's also a female, so that seems like a little unbalanced.

And that's something that I'd start looking for

in the phenotype data right away.

The other thing I'd do is I'd start looking at the data that I have,

the genomic information.

If it's a new data type in particular,

I want to know what kind of distribution I might expect to see, and so

I usually use something like the summary command to do that.

So if I type summary(edata), then for every column it's going to show me

a summary of the distribution, and so it shows me the quantiles,

min, first quartile, median, third quartile, max, and then the mean.

And so you can see here already from this data that they're very highly skewed data.

The min, first quartile, and median are all 0 and

then there's the gigantic max value, so already you can kind of tell that its

a skewed distribution and that appears to be true for every sample.

So that's already telling me something about what I need to be doing when I need

to be modeling.

The only thing I do is when I'm making tables,

I have to be always reminding myself that the table command,

if I do table on age, it looks just fine, it looks like there are no missing values.

But if you tell the table command to show NA values, and it actually will,

and you can see that there are actually three NA values.

So you have to use this useNA parameter to the table function in order to see

the NA values.

So I do that and then I start to check to see if there are other common missing

variable names, or common missing values.

And so, for

example, I might sum up whether there're any values equal to a blank space.

That's a very common missing value indicator in a phenotype table.

In this case, you see that the value that you get is NA,

that's because there are some NA values in the age variable.

And so I can do na.rm=True, so

that will remove the NA values when it's doing this sum.

And so I can see that there are no values equal to a blank space, so that's good.

I'm sort of checking to make sure the common variable,

missing names like blank spaces, negative nine, values like that don't appear.

Then the other things that I do is I check to see if there are missing values

in the genomic data.

So if I do is.na on the expression data, it's going to say for

every value it's going to check and see if it's missing or not, and

return FALSE if it's not missing, and TRUE if it's missing.

So is.na is true.

So I can just sum up those values over the whole data set, and

I can see how many missing values I have for the whole data set.

So in this case, there are no missing values in the expression data,

that's really good.

If there were missing values, you might want to isolate it to which genes and

which samples those missing values come from.

And so one way that I do that is I use the rowSums command, so again,

I can apply rowSums to my check of is there any,

so this basically checks to see if each value is NA or not.

And then I take the sum of each row, so that's each gene is a row.

So it tells me the total per row of missing values and then I can make

a table of that and I can see here in this case, 0 is the value for every single row.

So there's 52,580 rows, and all of them have 0 NA values.

I can also do that for the columns as well, and so

basically that's the same thing.

I see how many samples have NA by doing callSums(is.na(edata)) there,

and so now what this is going to do is check to see if the expression data is NA,

and then it's going to take, in each column, the sum and

see how many values there are, and I can make the same table here.

And see again, there's no missing values.

Okay, so then after I do that, the next thing that I do is I check and

make sure that the dimensions match in the way that I expect them to.

And so then the first thing that I do, is I check to see what's the dimension

of the feature data, and then what's the dimension of the phenotype data,

and what's the dimension of the expression data?

And so remember that the number of rows of the phenotype data should match

the number of columns of the genomic data.

The number of rows of the feature data should match the number of rows of

the expression data, since feature data describes the genes, and

phenotype data describes the samples.

So once I've satisfied myself that the dimensions match and

that there aren't any missing values or that the sort of variable,

once I've kind of gone through all the variables and looked at them,

the next thing I do is I start plotting, and we'll do that in the next video.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.