[MUSIC] All right, this week we are going to be going over some more Gene Expression Profiling Data Analysis, and here we're focusing on making sense of the results. So last week, we worked on selecting significantly differentially expressed genes. And once we do that, we're sort of reducing the number of genes that we focus on, considerably. So, instead of looking at all, say, 20,000 or 25,000 genes that are expressed, we are focusing on those that are differentially expressed between a control sample and our treatment. So we still need to organize those data to make sense of them. And the question I'm asking here is, how we can organize expression data. First we'll go over some data visualization tips, then we'll talk about grouping genes or, or samples, with similar expression profiles using hierarchical clustering. Then we'll talk a bit about grouping genes based on the pathways that they fall into, and doing an analysis that way. And then we'll also talk about functional classification according to Gene Ontology or some other kind of categories: MIPS is different but similar system to the Gene Ontology system. Then we'll go over leveraging Gene Expression Data, based on existing online databases of gene expression. So, we'll go over some of those databases. We'll talk about some tools for Co-expression analysis, which is a very powerful way for hypothesis generation. And, then we'll finally finish up with a bit about function prediction using GeneMANIA, which incorporates not only gene expression data, but many other kinds of data for function prediction. So, first some data visualization tips. Oftentimes when we're working with gene expression data, we work in log-space. So, log-space allows the, the visualization, especially of ratiometric results. So consider these two genes. Gene 1 and Gene 2. And over time relative to t0, Gene 1 is decreasing in expression. So say that the level of expression relative to t0, is half, then a quarter, an eighth and sixteenth. And Gene 2 is doing the opposite. It's doubling over time. So its expression goes from 2 to 4 to 8 to 16 times the level at t0. So if we map those values, those ratios on standard scale numerical scale plot, we see the following graphs. So here, Gene 2 is definitely going up. Gene 1 is sort of squished in number space between one and zero. And that's the number space that we have to work with. However, if we log transform those data, log base 2 transform them in this case, we get the values 0, minus 1, minus 2, minus 3, minus 4 for Gene 1. Which are the log2 values of this, log base 2. And for Gene 2, it goes from 0 to 1, to 2, to 3, to 4. And if we plot those values, we see that the responses of Gene1 and Gene2, are equal but opposite in direction, over time. So this kind of visualization is, or this kind of transformation is useful for visualization, especially for ratiometric data. And here the number space is open in this direction, in the negative direction, whereby here again, it's squished between zero and one. It's compressed between zero and one. So this is just an example where we're visualizing some ratiometric data. Black in this case, again, because of this compressed nature, anything that is less than one in terms of the ratio. So the, here we're talking about a treatment to a control sample. And each gene is compared to its expression level, in the untreated sample. The visualization would be black for a small ratio, and yellow for a large ratio. So, we can't really actually see any genes that seem to be decreasing in expression. However, if we do a log transformation of those data and then plot those, using a three color scale genes which are decreased in expression relative to the control, show up as blue. So, we see there are lots of genes that are actually decreasing in expression over the particular treatment in this sample, it's not important what it is. We see that there are several that don't show any change, under certain conditions or these are actually mutants up here. And then we see the, the ones in yellow, which are increasing in an expression, relative to the control. So this log transformation really helps us visualize these kinds of changes in a way that's not possible with untransformed data. So another kind of transformation that we often do, is Median Centering and Normalization. And the reason we do this often, is to help visualize the shape of the change of the vectors over time or under certain conditions, so consider these two vectors. We've got vector A - Gene A - and Gene B, and we see that for Gene A, there's an increase in sample two, a decrease in sample three and so on. And there's a little peak at sample ten, and in the case of Gene B, we see a, a big increase in expression in sample two, decrease in sample three. We see some changes further down the line. And if we plot those before transformation, if we just plot the raw data, and we use a color scale to represent the expression level, where black is equal to zero expression units, yellow equals the maximum expression level in Gene B, 1600 units. We, you, can maybe see some response in the case of Gene B. The response of Gene A is not very well to be seen just because the, again, the magnitude of expression here is quite small. So if we do some transformations, we can Log transform the data again. We could also do some median centering, so median centering simply means to compute the median, for all of the values, and then to subtract the median from each value in turn. So that's what we're doing here, in this second row. Then we can actually compress the vector so that the sum of squares is equal to one instead of being equal to 21.8 in this case. So, the sum of squares is a mathematical function. And the variance then becomes the same for both vectors. And when we do that, we can plot those values using the color scale, the same color scale, so these are just the transformations of the vectors. Here're our vectors that you saw, two slides ago. Here's the log transformed data. The grey line is the log 2 median centered data and the orange line is the log 2 median centered and normalized data. So that the sum of squares equals one. Same thing with vector B. And when we plot those, using a three color scale, where blue represents values below the median, or cases where the expression level's below the median. Black is at the median level, and yellow is above the median level. We see that both gene A and gene B are responding in exactly the same way as each other. And we don't really see that, when we just look at the raw data. So sometimes doing this kind of transformation is useful for visualizing response patterns. So next week (next module), we are going to organize the data by grouping them into groups, with similar expression patterns. And in order to assess whether expression patterns are similar, we need to use a similarity metric. We can use something like the Euclidean distance or the Pearson correlation coefficient, Spearman rank correlation coefficient. In the case of expression data the Pearson correlation coefficient is often used as it discounts the amplitude of the two expression vectors, and identifies those with similar shapes. Similar to the idea that I was talking about in the previous slide. So how hierarchical clustering works in the case of Pearson correlation coefficient: We've got two sets of numbers, So, two vectors of gene expression values. So gene X has expression value, X1 in sample one X2 in sample two and so on. Gene Y has expression levels of Y1, Y2, up to YN in the various samples. And then we just use this equation here to compute the correlation coefficient, the Pearson correlation coefficient r of the two vectors. But in the case of a gene expression data set, we would compute the Pearson Correlation coefficient for our query gene, gene X, against all other genes. So, we would do a bunch of Pearson correlation coefficient calculations. So in the case of Pearson correlation coefficient, one is equal to perfect correlation, where the two vectors are identical in terms of their shape, their response profiles. If it's zero, there's no correlation at all in the similarity of their expression patterns. And if it's minus 1, the response is actually opposite. So, that's the range for the Pearson correlation coefficient. So, to do hierarchical clustering, what we need to do is, we need to compute all possible pairwise distances between the gene expression vectors using say for instance, the Pearson correlation coefficient. We then join the closest neighbors as measured by the Pearson's correlation coefficient in our example here with the branch length reflective of the distance between them, and that depends on the type of linkage that we're, we're talking about. I'll get to that in a sec. We then recompute the distances, and when we do average linkage clustering a node's, coexpression score, is the average of the PCCs between the genes in the node and the other gene vectors. If that, if the average node correlation coefficient is higher than that between the two individual genes, join it to the best gene, otherwise we join the two genes with the best, score. And we repeat this step, these steps, until all nodes and genes are joined. So, what do I mean by Average Linkage Clustering? Basically we're assessing the PCC, the average PCC between all genes in two different clusters. In the case of Complete link, Linkage, we take the most distant score, to represent the score of genes between nodes. And in the case of Single Linkage Clustering, we take the smallest distance. So, here's a worked example. What we have here is, we've got five gene expression vectors. We've got five genes and four values for each of those genes. And if we generate a little heat map of those of that we see that in the case of the first gene, we start off low in, in this wild type sample. Then it goes up, and then it goes back down and mutant to three. In the case of the second gene again, it starts a little bit low, goes up and then comes back down. In the case of the third gene, we're starting a bit higher. We go down and we come back up and so on. And that's what the pattern looks like here. Clearly, we can see that gene four and the first two, probably should be grouped together, if we were to do this hierarchical clustering. So, what we do is, first we compute all by all Pearson correlation coefficient scores, and that's what I've done in this matrix, here. And then we find the one that's the best, the best score. So what is that, that's 0.99 here, between this gene, the fourth gene, and the second gene. So then we join those two, to create a node. The branch length here, there's a little scale up here with a PCC of one on the, the right side of the scale, zero and then minus one on the other side of the scale. So our branch length is very short because this score is almost one. So then, is the average of the coexpression coefficient, these two genes in the node, to all other remaining genes higher than for two non-connected genes? If the answer to that question is yes, then we join that node to the appropriate gene. Otherwise we would connect the two genes with the higher score, to create a new node. So what this means in practice is, we leave the the two that are connected out of the picture, because we've already joined them. That's that score, there. So then, we have three genes remaining and the average of this gene to the two genes in the node, we compute that average PCC score. We do that for the second gene that's remaining, and we do that for the third gene that's remaining. And in those cases, the average PCC scores are 0.96, -0.96, and 0.5. Now, this score here, is better than the score between these two genes, these two genes or these two genes. So, we would connect this gene to our pre-existing node, Node 1. And the value that we would use is 0.96, and that's what we're creating the second node here. Then we, we've connected now three genes. And we have two remaining genes that haven't been added to this, this tree that we're building. And what we need to do again, is to compute the average for the two genes that are remaining, to the three genes in the node, in nodes. And so we ignore those scores, where we've already connected things. We then compute the average score for the remaining genes to the other genes in the node. And that's for the other remaining gene, and the average is 0.53, in the case of the gene highlighted in red, and -0.97 in the case of the gene highlighted in blue. And the winner here, in terms of Pearson correlation score also, the PCC between these two genes is actually only what is it here, it's At4g27450, it's -0.57. So the winner is clearly 0.53 and thus we would connect that gene to Node2, to create Node3. And then finally, we have to add the last gene into our tree and we do that by computing the average score again. And our average here, is -0.875 and that would be the branch length that we'd use, to add that last gene into the tree. And the final output that you would see, is something like this, where we would group these three genes as all showing similar responses, they all have very short branch lengths to reflect high Pearson correlation scores. This gene here, is somewhat different from these ones whereby this last gene shows a very different response profile. So, that's how hierarchical clustering works, to create groups of genes. What we can then do is, we can ask for a given cluster of genes. Are any Gene Ontology categories enriched in my clusters? And here we can use a hypergeometric P-value to to determine that. And we plug four numbers into the, the hypergeometric function. And that's the number of genes in a given list with a particular function. There's the total number of genes in the list, then we have the total number of genes in the population, with a given function. So this is partly dependent on the platform we're using to assess gene expression levels. And then the number of genes in the population. Again, this is number of genes available for sampling. So we've talked about this in the protein- protein interaction lab but what we can do in the case of gene expression data is, start to take different clusters, clusters using different cut off scores and ask if there is an enrichment after multiple testing correction, for certain Gene Ontology categories. So for instance, these are data from yeast under response to different stresses, this slide is from Tim Hughes. You can see enrichment in this cluster of genes involved in methionine metabolism. And this can help us understand the biology of some of these clusters what the genes are, that are involved in the response to certain conditions. There are numerous tools available for online GO Enrichment Analysis, such as FunSpec, AgriGo we'll be using in the lab for many plant species, there's GO Term Finder, GOrilla etc., etc., etc. And the nice thing about the AgriGO output is, that we can see that we can see which categories as they lie within the GO graph which categories are enriched for certain processes. Another kind of analysis we can do with gene expression data maybe could be termed Pathway Analysis. And what we're doing here is, we're overlaying gene expression data onto a pathway that's of interest. So here I actually created this figure in 2003 for starch biosynthesis in rice, developing rice seeds, and their various isoforms of all of these gene products that are expressed at different times during seed development. And what, how important that is for the biology, remains to be determined. But certainly, by laying out the gene expression data in this manner, we can start to gain insight into certain isoforms that might be expressed at earlier time points or at late time points. We can do this in an automated way with tools such as MapMan or Reactome or BioCyc. Here you would upload your gene expression data and overlay the expression data onto such pathways that have been curated by curators. So, what we'll focus on now is actually leveraging gene expression data that are in online expression databases. There are several online expression databases that are available for Arabidopsis. There's one that my own lab runs called the BAR. For human, BioGPS is pretty nice. We've also got a tool for human expression data. For Mouse, there's the Jackson labs expression atlas. For worm, you can go to Wormbase. BioGPS actually has expression data for several organisms as does Genevestigator. In the case of online gene expression databases, for any gene that's on a particular platform, you can ask the question, where is that gene expressed? This is a gene of unknown function that seems to be expressed in developing seeds, as well as in response to cold stress, and we can use that information to generate hypotheses. We've also got this tool, this pictograph tool, for exploring mouse expression data and human expression data, and can, of course, get charts for where and when. As I mentioned, these data can guide wet-lab experiments. This is an example from Dinesh Christendat's lab in the department I'm in, and he's interested in an enzyme called 3-deoxy-D-arabino-heptulosonate-7-phosp- hate synthase, which is involved in synthesizing many plant secondary compounds. So lignan and things that provide UV protection, for instance, to the plant. There are three forms, three versions of this gene, DHS 1, 2, and 3, and, if you look at the phenotypes of the individual knock heads, the individual mutants, DHS 1 mutant, DHS 2 mutant, DHS 3 mutant, you don't see any phenotype. But, if you look at gene expression data, you can see that these genes are all seemingly induced in response to UV stress. So this is a data set from Kilian et al., and lo and behold, if you actually examine the DHS 1, 2, and 3 mutants in response to UV light, they all show phenotype. Compared to when you expose the wild-type plant to UV light, that they don't as well, so gene expression data can help guide you where to look for a phenotype if you can't see phenotype under "normal" conditions. Some RNA-seq data is slowly becoming available online, so, for instance, we use this in bioinformatics methods, one the JBrowse instance from the Mockler lab for exploring some RNA-seq data. The Encode Project has a large number of RNA-seq expression data built into the UCSC Browser. So we can also use these online expression databases to generate hypotheses by doing co-expression analysis and using the guilt-by-association paradigm. This is an example from Yu et al. Here they've shown that a gene RGL2, which is known to be involved in response to plant hormone gibberellic acid, in seeds, is also involved in flower development. So they did a lot of work to show this. You can read the paper. But basically, if we do a co-expression analysis with RGL2, what do we see as the top hits in our co-expression analysis? Well, we see a lot of floral homeotic genes, such as SEPALATA2, SEPALATA3, AGAMOUS, PISTILLATA, we talked about, APETALA3 and APETALA1, but what's the role of this expressed protein? And because we see so many floral homeotic genes in this list, the inference would be that this expressed protein is also involved in floral development, so we can generate hypotheses this way, just using databases of gene expression data. Certainly, this has been a starting point for several groups, and you can merely use this coexpression approach as a primary screen to identify novel gene functions, and we describe that in the paper here, Usadel et al., 2009. So there are several coexpression tools available. So, for plants, there's ATTED and Expression Angler that you'll be using today, Cressexpress and GeneCat and couple of others for mammals. You can use BioGPS, and there's the Correlation tab in the gene expression panel for human, mouse and rat, and so on. There's also a COxPRESdb for human, mouse and rat, and other organisms. That's again from Takeshi Obayashi's group in Japan. So finally, we'll finish up talking about a tool that can integrate multiple kinds of data for function. So not only gene expression data, but information about colocalization, so where if two proteins are in the same compartment, then perhaps they are more likely to be involved in the same biological process. Or if there's some kind of genetic interaction between them, or if they share protein domains, this kind of thing. So, all of these interaction data, broadly defined, can be used to help predict gene function. It's a very easy to use tool. Again, we will be using this in the lab. Paste in a list of genes or single gene, and the output will try to identify genes which share interactions, again, broadly defined, with other genes. And those genes that don't have any function ascribed to them or whose function is sort of generically described as kinase or transcription factor might also be involved with the genes that you've input into the tool. So, we've covered a lot of things today. You'll be exploring hierarchical clustering in today's lab, and we'll also be exploring some online expression databases, some coexpression tools, and we'll also be doing some function prediction with GeneMANIA, with our gene of interest, APETALA3. Hope you enjoy the lab, see you soon.