One of the most common applications of sequencing or high throughput technology to genomics that uses statistics is whole genome sequencing, or what was previously called genome wide association studies. So the idea here is we're basically directly measuring variability in the DNA. So we want to identify different types of variants, whether it's the single nucleotide variant or a deletion or insertion of a particular region in the genome. And so one way that people do this is with just direct re-sequencing of DNA. And so the idea is you have a DNA molecule, you fragment that DNA, and then you sequence it and then you look for variability. You basically look for variations in the genome compared to the standard reference genome and identify if there's any variability there. Now, more recently people are thinking about whether there should be more than one reference genome and there are other ways to identify and define variability. But for the moment the sort of standard pipeline is to identify those DNA variations that are different from the reference and then quantify how much those variations associate with different outcomes. And so basically once you've got the genome here and you've got your fragments you can identify, for example, if many fragments have a C versus a G there, you might say oh this might be a heterozygote for a particular variant that isn't necessarily the variant that appears in the reference sequence. And so the idea here is you can also do the same sort of thing with a microarray and in fact it was done very often with microarrays, so that's why this genome-wide association study approach has typically been applied on microarrays now. It's called whole genome sequencing, which is sort of the natural extension of that idea to sequencing. But with a microarray you basically do the similar sort of thing, you do a digestion and then a step where you compare basically the fragmentation of the DNA. You have these fragmented samples and then you compare them on probes that probe for the homozygous reference alleal, the homozygous variant and the heterozygote. And then you identify which one it is and so you can basically use that to genotype samples. This is sort of the technology that at least currently 23andme and other places like that use to very cheaply genotype lots of people. So the first step is variant identification and so this is actually if you're using a SNP chip or a microarray you can use software like the C-Realm software and Bioconductor that will basically identify. It'll compare the basically levels that you have observed for the different genotypes, probes and I then make a genotype call for each different variant. For variant identification within sequencing, it's a little bit more complicated, especially for whole genome sequencing, given the very high number of sample, or high amount of data that's being generated. And so two very common pipelines that are used for this are FreeBayes and GATK that often require sort of heavy computation to be able to identify variants and whole genome sequencing data which is often very large. The next thing to take into account, or sort of the standard way you would do any statistical analysis, is the confounders and one of the most common confounders is population stratification. So basically, often if you're looking for association between variants that might cause disease, the most common confounder is that there's population structure, and that that population structure might also be associated with disease. And so, there's two different ways that you can address this. There's actually many, but here are two concrete examples, the Eigen software, which is the EIGENSTRAT, and the SNPStats package will do sort of this PCA based adjustment for population stratification in bioconductors. So once you have those confounders, you do a set of statistical tests. Usually this is done SNP by SNP or variant by variant and you basically test for an association with the outcome. You might do this with a logistic regression model like we talked about, adjusting for some principal components. And at the end of the day you get a minus, you basically calculate a P value for every SNP, and then you often make these sort of Manhattan plots where you plot the minus log 10 P value. So this means like the smaller the P-value, the higher the value on this chart since you have the negative there. And so what you end up getting is sort of plots that look like this where the signals look like sort of spikes along the P-values like this that go above a threshold and people often use sort of Bonferroni corrections because there's typically expected to be relatively few signals when doing a disease association study. And so once you've identified those variance, then you basically can go down and annotate them and try to determine if they're sort of the causal variant. This is actually very tricky and a hard thing to do. It's, so what you've done isn't an association study, it's basically just identify variants that are associated with the disease. But you're trying to figure out which one might cause it if you can, and so people use software like PLINK, and the annotating genomic variants workflow and bioconducter to try to, sort of drill down into particular regions and look and see if we can identify which SNPs are most highly associated, and what's the LD structure associated with those. What genes that they're near or what regions they're in and sort of identify the variability genomically in the sort of regions you've identified. Now, actually defining them as the causal variant is quite another story and it's quite a bit harder to necessarily deal with and so there's been a whole bunch of software that's been developed to sort of like take steps towards that. So for example the CADD and variantAnnotation software basically will categorize the variance that you've identified and to these various different sorts of things. Whether they're a synonymous or nonsynonymous variation, where they're an intronic region or where they're in a splice site. By using that information, they can assign a score that says, this is how deleterious we think this variant might be. Ultimately what you need to do is downstream experiments to identify the functional variation associated with those genetic variants. Now here, I've talked about one particular type of application of full genome sequencing, or genome wide association studies. That's the sort of population based inference of disease-causing variants but there's obviously many other applications, including family based studies and rare variance studies which we are not going to be talking about here.