In this lecture, we will go over some practical details on what to
do when you're doing a meta-analysis of GWAS results.
I will discuss the type of plots that we make,
the kind of information that you need to ask from the people that you work with,
and what to do in terms of quality control before and after running your meta-analysis.
Now for the plots,
there's basically one type of plot that we make for each of the significant SNPs,
and that is the so-called forest plot.
What you see here on the horizontal axis is the effect size,
where the vertical line indicates the effect of zero, so no effect.
And each of the horizontal lines indicates a study,
and the little squares are positioned at the effect size of that study,
and the size of the square indicates the weight of each of the studies,
so the w_i that you computed.
The size of the horizontal lines is actually
a measure of the standard error for that study,
and the summary statistic values are collected in the diamond at the bottom,
so the position is that your meta-analysis beta values,
so your full effect estimate,
and the width of the diamond indicates the confidence level,
so your standard error.
Now if you do a meta-analysis,
and you collaborate with many people to get data from all kinds of centers,
then you need to ask for the right information.
So what would that be?
If you look, for example, at this table,
you see three studies, A, B,
and C, and three SNPs.
Now for each of these studies and SNPs,
we collected information on n,
the sample size, beta,
the effect size, and the standard error.
Now let's have a look at the first SNP, rs1234.
If you do a meta-analysis on this SNP,
you will see that, unfortunately,
the results are not significant.
Your p-value is 0.45.
So you start thinking,
"What can I do to check whether this is true?"
If you look at this beta value for study C,
it looks slightly suspicious.
It has a sign in a different direction.
So now, maybe pause your video,
and think what information you can ask from your collaborators.
Right.
The thing that you're missing is,
what's the reference and what the effect allele were?
And as you can, see now that we asked the information from our collaborators in study C,
they have actually flipped the alleles compared to the other studies.
Now, if we re-code the alleles from A to G instead of G to A,
you can change the sign of your effect estimate, of your beta,
and if we do re-run the meta-analysis now,
we do get a significant result.
Another example.
Here, you see the same situation as before,
but again, there is no significance.
So again, you start scratching your head and think, "What can I ask?
What could be wrong in this case?"
So pause the video and think again.
What are we missing?
Yes, it's the strand.
And if we ask for the strand to our collaborators in Study C,
you see that, for them,
the strand was the minus strand.
Now if we re-code this variant in their study to the plus strand,
we have to change the alleles.
So instead of A T it becomes T A.
And as we saw in previous example,
we can flip them back, change the sign,
and we are happy again because our result is significant again.
So this is just to show you that you need to ask a lot of information and check many,
many things before you run your meta-analysis.
Now here, you see some typical things that you need to ask.
Some things are really required,
others are good as a check.
So for example, you need to know the rs number, or the SNP name,
what reference allele was used,
the genome build, the strand,
how many samples were used, what, of course,
the effect estimate was,
and its standard error,
you want to know what the effect and coded alleles were,
the allele frequency of the effect allele,
the chromosome on which it was,
and its genomic position.
And a few other things that you may want to ask is what
p-value was used as a cutoff for your Hardy-Weinberg equilibrium checks,
whether the SNP was imputed or not,
if it was imputed, what it's R-squared value was,
the p-value of the effect,
the QC thresholds that were used for the array,
and if there was any genomic control corrections,
so the lambda GC that was used.
So now, you're ready for some quality control of this meta-analysis data.
So before you do that,
before you start pooling all the data together,
you actually need to calculate genomic control again,
and make a QQ plot just to verify the things that were sent to you.
And of course, you try to harmonize the QC as much as you can,
and if you find points, data points,
where you don't know what was done,
no idea about the QC,
how they were handled,
it's better to exclude those from your meta-analysis.
Now, after you run the meta-analysis, what you can do,
or what you have to do, is to test the assumptions for the models that we use.
So you want to check for heterogeneity,
but you only do that in the significance SNPs.
Otherwise, it will be too much work.
And if you find any heterogeneity,
you will have to explain this in your paper.
One way to test for heterogeneity is Cochran's Q statistic,
which is shown over here.
And as you can see, it's defined as the weights of each study
multiplied by the difference between the actual effect size of
the study and the effect in the meta-analysis squared,
and then summed over all studies.
Now this statistic is distributed as chi-square of N minus one degrees of freedom,
where N is the total number of studies.
Another way to quantify the same information is using the I squared statistic,
which gives you basically a number between zero and 100, it's a percentage.
And in general, this I squared larger than,
let's say, 50 is deemed, well,
means large heterogeneity, and that means
that most of the variability is due to this heterogeneity,
not due to chance.
So you need to check this and go back,
and ask or find out what is wrong with that SNP.
Maybe it was just a badly imputed SNP.
So, once you finish this meta-analysis and you have your test statistics,
you compute the chi-square values,
and then the two-sided p-values.
Again, you check your QQ plot,
so you look for your lambda GC,
you use any known associations as sanity checks, if there are any,
and then you really want to stick to
the nominal genome-wide significant p-value of five times 10 to the minus eight.
Only maybe if you have a population with low LD,
you could maybe set a stricter threshold like two times 10 to the minus eight,
but in general, don't touch that.
Then you create your manhattan plots,
and inspect the top hits in databases like dbSNP,
or you do pathway analysis,
or maybe do some conditional analysis.
Now with that information,
I think you can confidently run your own multi-center GWAS.