[NOISE] Hi, in this lecture we're going to begin discussing Gene-set Enrichment Analysis. So let's look again at an, an example of a Gene-set Library. And in this particular case we are looking at a Gene-set Library created from the Gene Ontology. So, the Gene Ontology is the general effort to associate genes and functions. So this pics perfectly with the idea of having gene set associated with some biological function. For example the first line in this file, this is the text file open in excel again. And this is a protein secretion, it has a GO ID. And then it has genes associated with that term. The second column is reserved for description. And this is the format of the GMT file. Here I'm demonstrating how the GMT files can be used for their main purpose, which is enrichment analysis. So a little bit more about the Gene Ontology. This is an important component of a data analysis in computational systems biology. It's an effort to try to associate genes with function and the functional terms are organized in tree that describe the terms in a level of detail or the granularity of each terms. So each term is associated with apparent term that is a little bit more general. And then genes are attached to those branches of that Gene Ontology tree, where the root of the tree is, it's either a cellular component, biological process, or molecular function. And then the terms are becoming more and more specific as we going down the tree, and then the genes can be associated with terms at any level of the tree. And there are annotators and a consortium that keeps assigning genes to terms, as well as establishing the Structure of this functional term's trees. This is another example of a GMT file or a gene-set library. And this one was created from the KEGG pathway database. So the KEGG pathway database is maintained at the Kyoto University in Japan. And this is a manually curated databases of cell signalling pathways. And here, the term is a pathway. It has a KEGG ID, which stands for a pathway in human cells and then you have the pathway gain. And then the genes that are members of that pathway. So if we go to the KEGG database website, those pathways are described in the, in those, in those images that are shown as directed graphs. Where the nodes are mostly proteins. And then they are connected based on their functional of physical relationships that describe the flow of the signal, typically from the membrane. And here you see various growth factors binding to growth factor receptors. And those receptors transduced the information into the insides of the cell through a cascade of cell signalling components reaching an important component, which is KERG or MAP Kinase and that's the name of this pathway. And that MAP Kinase pathway is a classical pathway. And the KEGG database is one of the most comprehensive manually curated databases that you'll find for cell signalling pathways in mammalian cells. So what we've done is taking the database and then flatten it out by creating gene sets from each of those diagrams or those maps of cell signalling pathways. So why destroyed those Ontologies, like for example in Gene Ontology, we have this three structural functional terms. And we just took functional terms at certain levels of the tree and created gene sets from them. And we also so, sort of like destroyed the pathway information from KEGG and we create these gene set libraries from that. The reason we are doing that is because we can now apply a simple test to compute gene set enrichment for lists of genes or proteins that were identified as deferentially expressed in various experiments that are applied to mammalian cells. In addition, we can start finding relationships between the pathways or between the terms. Here we are concerned about finding the or prioritizing functional terms based on their enrichment if we have a set of deferentially expressed genes or proteins. So, this is the most simple and the most commonly applied test to measure enrichment for a list of genes that were identified experimentally to prioritize and rank the terms in those gene set libraries. So for each set, we compute a P value that evaluates the enrichment level of that term with your input list of genes. So this is a contingency table problem. You fill the table below with the information that represents the overlap between the set in each of those gene set libraries, each of those are rows in the files that we just looked at and your input gene list. So you're looking for how many genes overlap between your set and each row in those gene set libraries. And that number goes on the top left. And then you want to have a number of the number of deferentially expressed genes that you identify. And then the number of genes in the set that is the number of genes in each row. And then some or the number of genes in the gene set library and that number would be the deno, denominator. And this will be in the bottom right square of this contingency table. Now, using the Fisher Exact Test we can compute the probability for overlap, when compare this to the binomial probability of filling this table for random data and by plugging in the numbers, as you see here in an example from MathWorld, you can compute R1 and R2, which are the sum of the rows. And C1, C2, which is the sum of the columns. And then you can compute N, which is the total sum for all the genes in all the four squares. And then we can have once we have R1, R2 and C1, C2 we can plug them into this formula. Give us a probability of what is the chance of seeing that match overlap for the gene set that's we query against in our input gene set. So we have implemented this approach or tool called Enrichr. And Enrichr has been a very popular tool. Every day, there are about between 100 to 200 lists that people upload to the system. What Enrichr also has which is very powerful is various types of visualization capabilities that use that generate vector graphic images that are exportable for publications. Now I'm going to show you a live demo of Enrichr. So first, you can just type in Google, Enrichr. Notice that it's enrich and r without the e. And then you get link and this will take you to the website. This is the upload part of the website. Here you put in your list of genes. If you don't have a gene list you can use the example. And you can click on try regular example. And you can also upload files with gene sets by using the Choose File button. So once you uploaded your list of genes, it automatically tells you how many genes you entered. And also you have an option to fill in a name for your list. So once you upload [SOUND] the gene list, you press the arrow key here. And that executes the enrichment analysis. So immediately, we are shown the transcription category of gene set libraries. And now we can start looking through the results by clicking on each of those categories. For example, for ChEA, we just click on the word ChEA. Then we get a bar graph that displays, visualizes the enrichment level for each of those terms in the ChEA database. The ChEA database is a manually curated database that we develop in our lab, where we extract from publications then describe ChIP-chip, and ChIP-seq experiments, the targets for mammalian transcription factors. So, this is the name of the transcription factor. And this is the PubMed ID that give you a link to the paper that was published about the target of those transcription factors. As well as the organism, where the study was done. You can click on the bars and that changes the ranking. We have three methods of ranking. And one of them is using the Fisher test and another one is a modified test that we developed. You can also change the colors of the bar graph. You can view the results as a table. And then when you mouse each over the enriched terms you can get the genes that overlap between the term and the input list. And here you get the P-value, the Z-score, which is computed using a method that we developed. And then you get also a combined score that multiplies the P-value by the Z-score. You can also see the result as a grid, and here the entire gene set library is visualized, as this grid enrich terms, from the library a highlighted on this grid. So you can see that if you click at anywhere on this grid, you switch to a view that shows you the background GMT file as a network, where highly dense clusters of transcription factors that share targets are close to each other on, in proximity and those are the brighter spots on this grid. And then enriched terms are visualized, as circles. The final visualization is a network of enriched terms. And here, we connect the enrich terms based on their similarity. After you look at each of those categories, you can identify the most interesting enrich terms. So in the transcription factor category, we have data from ChEA, TRANSFAC, and JASPAR position weight matrices, which those are computationally identified binding sites for transcription factors. Also position weight matrices from the Genome Browser, which are also computationally identified motifs in the upstream region of genes based on consensus sequences. The Histone Modification ChIP-seq gene set library was created by processing data from the NIH project, the epigenomics roadmap. And Yanko later on will describe to you how she processed the ChIP-seq Histone Modifications data to create this gene set library. MicroRNAs of many target genes and we took predicted microRNAs for target genes from the database called TargetScan. The ENCODE is a large project that profile transcription factors in various mammalian cells. And we process the entire database, the entire ENCODE database to create a gene set library for the ENCODE Transcription Factors is in ChIP-seq and that also will be covered in Yanko's lecture, which he will discover detail processing of those, those data sets. He also process transcription factor perturbations from the Gene Expression Omnibus, where you have Gene Expression Microarrays after perturbations of various transcription factors. And this was done by Kevin Hu, who was a summer student in, in our lab. Besides the transcription factor category, you also have a pathways category. We just went over the KEGG Pathway database. There is also a very large pathway database called WikiPathways. And then Reactome, BioCarta, also major well-known pathway databases the PPI Hub Proteins. Is a gene set library that we created from the human protein-protein interaction network. Where we took only the proteins that have 50 or more known protein-protein interactions. And in this particular example CDK5 have the most enrichment for genes that interact with it. The KEA database is a kinase substrate gene set that we develop in our lab CORUM is the database of protein complexes. And SILAC Phosphoproteomics is a experimental technology that uses Mass spectrometry to identify changes in phosphorylation on proteins. In the Ontologies categories, we have the Gene Ontology, I just described to you and then also the MGI Mammalian Phenotype Ontology, wiki, which is an ontology that was developed at the Jackson Lab, where they described various properties of mice after knock down of various genes. So here for example, you have the ontology ID term, which for example is the most enriched term for this gene set is a normal lipid hema, homeostasis. This particular phenotype was observed for these genes and are part of our input list. In the Disease, Drug category we have Up and Down regulation from the CMAP or the connectivity map and we're going to talk about the connectivity map later on this course. We also have genes associated with various diseases. Also a large dataset called GeneSigDB that was borrowed from the group at Degrode that originally developed that year of gene set enrichment analysis. We also have proteins that interact with various virus proteins. This is from a database called VirusMINT. So in here a virus is functional term and all the proteins that interact with the viral proteins are listed in each set. And this is a very recent gene set library that we created, which lists in drug perturbations that we processed from GEO. The cell type category has genes that are highly expressed in various cell types. So the first two libraries are from the BioGPS Human and Gene Atlases. So those describe various tissues, for example that were profiled with microarrays. And what we've done, we identified the genes that are highly expressed in each tissue. This is the same exercise by comparing various cancer cell lines. So this is comparing gene expression in 500 cell lines and identifying the genes that highly express each cell line. And also this is done to the NCI-60, which is a panel of 60 cell lines maintained by NCI. In the Miscellaneous categories, we have genes that share a chromosomal location, and also a genes that share metabolites, as well as genes that share structural domains. Enrichr also have a feature, where you can share your results with your colleagues. So if you click on this link button, you can get a URL. And that URL can be saved and then email to a colleague to share the results of your enrichment analysis. If you click on the Enrichr logo, you can see that there is some more places to go on the website. In the Dataset statistics you an see the list of all the gene-set libraries, their source. So, this label is linkable to the source of the data. This will take you to the Connectivity Map, it tells you how many terms are in each gene-set. And how many genes are covered by the entire gene set library. You can also use this Find a gene function. So here, you put a single gene. So, for example, if I put MAPK3, which is also known as ERK1, one of those famous cell signalling components. I can obtain all, all the sets in the database that these genes belong to. So, based on this analysis, MAPK is regulated by these transcription factors based on ChEA. And if you go to the pathway, it belongs to those cell signalling pathways. When you knock it down. It causes these phenotypes. When you treat cells with those drugs, the expression of this gene is operant related. So this is input the Enrichr demo. There are many other tools you can use for gene set enrichment analysis. There are also more sophisticated tests for enrichment. [MUSIC]