After you've done a differential expression analysis, often you want to be able to combine the results into some interpretable story of what's happened. And so one way to do that is to use gene setting enrichment analysis to identify gene sets that are enriched among the differentially expressed genes. You could also use enrichment in lots of other ways to identify regions in the genome that are enriched for different signals. I'm going to hear folks on gene set enrichment for a differential expression experiments. So, I'm going to set up my parameters as always, and I'm going to load the packages that we're going to need, in this case the two main drivers are going to be DESeq2, and goseq. So I'm going to load those packages, and then I'm going to perform an analysis using this goseq package. So the first thing you might want to look at is it tells you which genomes are supported, in other words, which genomes does it have information on that genome. So, it has these different databases and the species. And that way you can go and look and see is the genome you're using for alignment or for analysis or for what organism your using, does it exist in the database. Because it's going to need to be able to calculate the length of the genes in the genome in order to do the appropriate adjustment. You could also look and see which gene IDs are supported. You can actually build your own supported genome with the goseq package if you have the data available in the right format. You can sort of reformat and add to these lists of available cases. So here I'm going to use an example from the goseq package. I'm going to load that example here. So I'm loading a system file. So basically this is just loading a data set from this package, from the goseq package. And so I've read that in to R, and so this data set actually, so if you look at the first row of this data set, you can see that the genes are labeled within the data set. So we're going to remove that first column. And we're going to label the genes with that column, but we're going to remove it from the data set. So the first thing that I do is I remove that column from the data set. Then I'm going to make the row names of our new expression object to be equal to that first row. So I've made the row names equal to the gene names. And then I'm going to do some filtering here to remove lowly expressed genes. This is particularly important in this case because otherwise we'll end up with some genes that can't be tested. And so, then once you start combining results for genes that can't be tested, it's going to be harder downstream. So in this case, I'm going to have to set the group variable up. It doesn't come with the pre-loaded data set. But the first four samples are control samples, and the last three samples are treated samples. So then I'm going to create a pdata object by just basically building a data.frame with just that one variable in it. And then I can pass that data to DESeq. So again I have to build the DESeqDataSetFromMatrix. I have to use that quite complicated long function to pass it the expression data, the phenotype data and the model we want to fit, in this case the relationship to group. And then I can use DESeq to identify differentially expressed genes. So it's going to take a second to fit that model because it's again fitting a relationship between the mean and the variance with the model fit. And then I can calculate the results using the results function. That, okay. So now I have the differentially expressed results. And one of the things that I see in that differentially expressed results table is I've got a pvalue or an adjusted pvalue for every single gene. So the first thing I'm going to do is I'm going to find those genes that are differentially expressed at an FDR level. 0.05, and so I'm going to do that by looking at the p adjusted. So this is the adjusted pvalues for false discovery rate. So we're going to look at those that are < 0.05 to identify the genes that are differentially expressed at that level. Okay, and then I'm going to remove the ones where there wasn't enough data to check, basically, for differential expression, differentially expressed genes. And then I'm going to apply row names. So, I have to apply the row names to this data set because the function requires the row names to match for the genes to the genomic data that we're going to be downloading in just a minute. So, then I'm going to remove the genes where we can calculate a statistic. Okay, so now what I am going to do is I am going to basically go back and identify in this case the hg19 genome. So what I am going to do is I am going to have to calculate the probability weight function. And so you can do that by passing it the genes, the ones that are differentially expressed. You can pass it the genome that it's going to be using, and then you can tell it that it's going to be using the ensemble gene references. And, so now we need to calculate, basically, a set of weights that are calculated basically proportional to the, sort of the size of the genes when you're doing a gene expression analysis. In RNA sequencing data, the length of the gene actually matters for the levels of expression which relates to power, which means you have to adjust for that when you're doing a GO analysis. So then you can use the goseq package to actually calculate the statistics for enrichment. And so basically you apply the goseq function to the probability wave function that you calculated from these genes. And then you pass it again, the genome that you're looking at. And so it's going to go out, and it's going to try to find the GO annotations from the web. And then it's going to tell you if there are any genes where it couldn't find the annotations. And so then it's going to calculate the pvalues. This is relatively quick compared to, you can also use a non-parametric version of this test by changing some of the parameters. But if you do that, you, it'll be slower because it's basically going to commute the labels and recalculate. So if I look at the result of that, it tells me the GO category names. You can look this up on the Gene Ontology website, how over_represented it is, what the pvalue is, more under_represented. And then it tells you the number of genes that were differentially expressed in that category, the number of genes that are in that category overall. And then it tells you what the category is, so single-organism cellular processes, extracellular region part, and so forth for this differential expression experiment. Okay, if you want to look at an particular category, you can do that too. So, here, it's the exact same test I ran before, the goseq test. I pass it a probability weight function, the genome that we care about. But now if I want to change the categories, if I say only test the molecular function categories of gene ontology. So gene ontology is broken up into biological processes, molecular functions, and cellular locations. And so if you only want to test the molecular functions, you can actually have goseq go and actually only analyse certain subsets of the annotation. So this is helpful if you know for example that you only care about the function of the genes, and you don't really care about where they're located in the cell. And so, then I look at this, and you can see that now that ontology term is molecular function for every single one. So that's a brief introduction to how you identify these categories. Remember to keep in mind that the more specific they are, the more interesting they are. If they're really general purpose categories, it's sometimes a little bit hard to interpret.