0:02

This lecture's about experimental design as well it's about sample size

and variability.

So if you remember from previous lecture,

the central dogma of statistics is that we have this big population.

And it's expensive to measure, you know,

whatever measurement that we want to take genomic or

otherwise on that whole population so we take a sample with probability.

Then on that sample we make our measurements and

use statistical inference to say something about the population.

So we talked a little bit about how that best guess that we get from our sample

isn't all that we get, we also get an estimate of variability.

So let's talk a little bit about variability and

what its relationship is to good experimental design.

So there's a sample size formula that you may have heard of that's,

if N is the number of measurements that you could take or

the number of people that you could sample.

If you're doing scientific research, you have to ask for grant money often and

so N ends up being the number of dollars that you have

divided by how much it cost to make a measurement.

And while this is one way to get at a sample size, it's maybe not the best way.

So the real idea behind sample size is basically to understand variability

1:06

in the population.

And so, here's a really quick example of what I mean by that.

So here are two synthetic made up data sets.

So there's a data set for Y and there's a data set for X.

So the measurement values on the X axis and

on the Y axis there's the two data sets, YX.

And you can see, I have two lines here, the red line is the mean of the Y values

and the blue line is the mean of the X values.

And so, what you can see is that the means are different from each other but

there's also quite a bit of variability around those means.

Some measurements are lower and some measurements are higher and they overlap.

So the idea is, if the two means are different, how,

how confident can we be about that?

If we know what the variation is around the measurement that we've taken and

the mean that we have.

How confident we can be that these two means are different than each other?

So this goes through how many samples that you need to collect?

How much variability you need to observe to be able to say whether

the two things are different or not?

1:57

So the way that people do this in advance in sort of experimental design is

with power.

So basically, the power is the probability that

if there's a real effect in the data set then you'll be able to detect it.

So, it depends on a few different things, it depends on the sample size,

it depends on how different the means are between the two groups,

like we saw the red and the blue lines.

And it depends how variable they are, so

we saw that there was variation around the means in both the X and the Y data sets.

So this is actually code from the R statistical programming language.

You don't have to worry about the code in this lecture but

you can just see that for example, if we want to do a t-test,

comparing the two groups which is a certain kind of statistical test.

The probability that we'll detect an effect of size 5,

that's what we have delta there with a variability of 10, the standard deviat,

standard deviation of 10 in each group and 10 samples is 18%.

So it's not very likely that even if there's an effect we'll detect it but

what you can do is you could also go back and make the calculations,

say, as is customary, we want 80% power.

In other words, we want an 80% chance of detecting an effect if it's really there.

So for a effect size of 5 and a standard deviation of 10,

you could see that we could calc back out, how many samples that we need to collect?

Here, in this case by doing the calculation,

we see we need 64 samples from each groups

in order to have an 80% chance of detecting its particular effects on us.

But similarly, you can do that calculation by saying, how many do you need to have

for one group if you're only going to be doing, or for each group,

if you're only going to be doing a test in one direction or the other?

So suppose, I know that the effect size will always be expression levels will be

higher in the cancer samples than the control samples.

Then it's possible to actually create, less, less samples and still

get the same power because you actually have a little bit more information.

Later classes and statistical classes will talk more about power and

how you calculate it.

But the basic idea is to keep in mind that you, the power is actually a curve.

It's never just one number even though you might hear 80% thrown around quite a bit

when talking about power, the idea is that there is a curve.

So when there's no, in this plot,

I'm showing on the X axis, all the different potential sizes of an effect.

So it could be 0, that's the center of the plot or it could be very high or

very low and then on the Y axis is power for different sample sizes.

Black lines correspond to sample sizes of 5, blue line corresponds to sample sizes

of 10 and red lines correspond to sample size of 20.

So as you can see that,

as you move out from the center of the plot, the power goes up.

So, the bigger the effect,

the easier it is to detect, also as the sample size go up, goes up,

you see from the black, to the blue, to the red curve, you get more power as well.

So as you vary these different parameters, you get different power and so

a power calculation is a hypothetical calculation based on what you

think the effect size might be and what sample size you can get.

And so, it's important to pay attention before performing a study

as to the power that you might have so you don't run the study.

And end up at the end of the day without any potential difference

even when there might have been one there.

5:13

So, variability of a genomic measurement can be broken down into three types,

the phenotypic variability.

So, imagine you're doing a comparison between cancers and controls.

Then there's variability between the cancer patients and

the control patients about their genomic measurements.

So this is often the variability that we care about,

we want to detect differences between groups.

There's also measurement error,

all genomic technologies measure whether it's gene expression,

methylation, whether it's the alleles that we measure in a DNA study.

All of those are measured with error and so

we have to take into account how well does the machi,

machine actually measure the reads, how long we quantify the reads and so forth.

There's also a component of variation that often gets ignored or

missed which is natural biological variation.

5:55

So for every kind of genomic measurement that we take,

there's natural variation between people.

So even if you have two people that are healthy, have the same phenotypes in every

possible way, they're the same sex, the same age, they eat in the same breakfast.

There is still going to be variation between people and that natural biological

variability has to be accounted for when performing statistical modeling as well.

An important consideration is that there's often a rush when there's new technologies

to sort of claim that this new technology is so

much better than the previous technology.

One way they do that is by saying that the variability is much lower

that may be true for the technical component or the measurement error

component of variability, but it doesn't eliminate biological variability.

So here I'm showing an example of that,

there are four plots in this picture that you're looking at.

The top two plots show data that was collected using next generation

sequencing.

The bottom two plots show data that was collecting with micro

razing with older technology.

Each dot corresponds to the same sample, so

it's the same samples in all four plots.

And so what you can see is for the gene on the left, you see that the pink gene,

you can see that there's lower variability across people.

So this is true, whether you measure it on the top with sequencing or

on the bottom with arrays.

Similarly, the gene on the right that, I've colored in blue here

is highly variable when measured with sequencing or when measured with arrays.

So what this suggests is that biological variation is a natural phenomenon

that always is a component of non modeling data in genomic and

it does not get eliminated by technology.