[MUSIC] All right. In this week's class, we are going to be doing some gene expression profiling data analysis, and today's talk will cover an overview of technologies so cDNA microarray, an older technology, versus short oligo versus RNA-seq. We'll talk about background adjustment for these two kinds of microarrays. We'll talk about normalization for all three methods of gene expression profiling. We'll talk about selecting significant genes after we've generated the data. And then in next week's class we'll go over some methods for clustering. And we'll talk about data mining gene expression data where we can leverage gene expression data to generate hypotheses about gene function. Why would we want to do gene expression profiling? Well, sequence similarity provides a function in only about 40%-60% of the genes identified in a genome, a sequenced genome. There are many lineage and specific species genes. Sequence similarity won't identify novel functions of proteins, so that is genes that show functions under different conditions. Genes involved in the regulation, interaction, or integration of pathways are the most difficult to identify in this context, the novel function aspect. And traditionally, these have been identified using genetic or mutant analysis as well as biochemically. Many of these genes are expressed at low levels or show only transient expression and may have been missed by typical molecular biological or genetic methods. So, if we can get at gene expression data, this will help us generate hypotheses about function. And this area is the area of functional genomics, which seeks to devise and apply technologies that take advantage of the growing body of sequence information from many organisms to analyze the full complement of genes and proteins encoded by an organism. And some of the major approaches undertaken with functional genomics are to determine the expression pattern of all genes, in different tissues, under different conditions determining the expression and distribution of all proteins that arise from the transcripts from the genes. We can also knock out all genes and subsequently examine the phenotype of these knock-out gene organisms. And we can also look at the resulting gene expression patterns. We can also identify the interactions among proteins, using two hybrid analysis and other bait methods, TAP tagging, so we talked about this a couple weeks ago. This is a table from Oliver in 2000. And it talks about the various levels of analysis of functional genomics, so we would have genomics, which is the complete set of genes of an organism or its organelles. And its status is context-independent, so basically regardless of the cell or tissue type that we're getting the DNA from the sequence would be the same in all of those cells. There can be of epigenetic modifications and clearly these would be context dependent, epigenetic modifications would be context dependent. And the way we generate genomics information is through the systematic sequencing of DNA. The Transcriptome is the complete set of messenger RNA molecules present in the cell, tissue, or organ and here this is definitely context dependent and this means that the complement of RNAs varies with changes in physiology, development, or pathology, depending on which tissue you're looking at, cell type and so on. How can we assess the transcriptome? We can use microarrays. We can use SAGE or massively parallel signature sequencing or digital gene expression analysis. In quotes "high-throughput" northern analysis we can generate ESTs, express sequence tags, and the latest method for assessing the transcriptome is RNA-Seq. We can also look at the Proteome which is the complete set of protein molecules present in a cell, tissue or organ. Again, it's context dependent, just like the transcriptome. We can analyze the proteome using two-dimensional gel electrophoresis, peptide mass fingerprinting, using two hybrid analysis to identify the interactions between proteins in the proteome. We can also use peptide and protein microrrays to get at aspects of the Proteome. The Metabolome is the complete set of metabolites or low molecular weight intermediates in a cell, tissue, or organ. Again, this is context-dependent, depending on the cell type or the tissue that we're looking at. We can determine the metabolome using NMR or mass spectrometry or infrared IR spectroscopy. So we're focusing on transcriptomics today, gene expression profiling, and the first technology that was developed for high throughput analysis of the transcriptome was cDNA microarrays. And in this case what we need to do is generate PCR products representing all of the cDNAs, all of the transcripts of a given cell type or, better, of the whole organism. The full complement of transcripts possible from the genome is the best case. And you would amplify each gene's transcript, each gene's cDNA in a PCR reaction, often in a 96-well microtiter plate or a 384-well microtiter plate. And then we would spot each PCR reaction on to a defined position on a glass slide So each spot would contain many copies of one cDNA, a specific cDNA. This is preparing the cDNA microarray. And then in order to generate data, what we would do is we would take RNA samples from, say, the control condition, 30 degree Celsius bacteria. Bacteria grown at 30C here or from a treatment, here bacteria grown at 42C, and then we would reverse transcribe and label those mRNAs with different fluorophores, here Cy3 and Cy5, So we get some "red" cDNA and some "green" cDNA. You would mix those samples and then we would do a hybridization, a competitive hybridization, to our slide and if one particular transcript were more abundant in the 42 degree sample, the spot would be red. If a different transcript were more abundant in the control sample, the spot would be green, and if the two transcripts from a given gene were equally abundant the spot would come out yellow in colour. And this is just the output of a cDNA micro-array showing you this distribution of red, green and yellow spots. The next iteration of this technology was oligonucleotide microarrays and in this case, short oligonucleotides, 25-mers, would be generated. And there would actually be two kinds of these oligos, a perfect match and then a mismatch oligo, with the mismatch being right at the middle. And sets of these oligos, of these probes, would be chosen, primarily tiled along the three prime end of the transcript. And these sets would be synthesized actually, these oligos would be synthesized on silicon substrates similar to the way that semiconductors are generated. with photo resist, photo lithography. And then what we would do in this case is we would take RNA from cells, extract the polyA RNA or at least label the polyA RNA by converting it to cDNA, doing an in vitro transcription reaction to incorporate biotin labels onto the transcripts (the cRNA in this case it's called), and then we would take those labeled fragments and hybridize to these, oligonuceotide microarrays, Affymetrix microarrays. And then do a wash and stain to detect the biotin on the fragments, and then the signal - we would scan these arrays and a given spot would fluoresce in proportion to the abundance, to the amount, of transcript hybridized at that particular position, representing a particular gene - so the stronger the signal, the stronger the expression of that gene. The most recent iteration of this technology is RNA-Seq. And in this case, what we do is we take our mRNA, we can either shear it beforehand or convert it to cDNA. We then attach linkers, or copy it into cDNA in this case, or shear the cDNA. Then we convert, up here we convert it to DNA, where we attach linkers, we attach linkers, and then we do next generation sequencing to count/sequence these small fragments. And through bioinformatic analysis, we'll talk about that in a bit, we can then figure out which bit of sequence represents which gene. And the number tags of that particular sequence can be used to determine the abundance of a particular transcript. So if we get lots of these tags mapping at a given position we know that there's a lot of that transcript. If we don't get a lot of those tags mapping at a given position back to the genome then we know that the gene is not highly transcribed. So there's some image processing, some computational work necessary in the case of the microarrays to extract information after a hybridization. Our input for this step is scanned images of array fluorescence, and then what we need to do is we need to apply a grid to the image, after which each position and value on the slide must be associated with the appropriate identifier, gene identifier, and the output is a table of IDs, gene IDs and their values, and there are lots of programs, both public and proprietary for doing image analysis, such as ScanAnalyze from the Eisen lab... Affymetrix has its own software for reading its GeneChips and generating data files. This is just a visual representation of what I just said. Basically, what we would do is we would superimpose a grid over the scan image. There's a bit of software magic that happens and the spots can be automatically found. This grid can be adjusted and then we need to determine the intensity of the fluorescence within the spot. And subtract from that the intensity of fluorescence of the background outside of the spot. In the case of the next generation sequencing, basically there's some procedure that happens to create a library of small fragments with linkers attached, this is library preparation. We then affix these small fragments of DNA, so cDNA in the case of mRNA to a flow cell, to the surface of a flow cell. And then we grow clusters on this flow cell and we sequence that through the use of fluorophores. So we would start with a primer, and we would then add on in each cycle different nucleotides and the incorporation of a given nucleotide would lead to the fluorescence appropriate for that nucleotide. It can read off the sequence from the flow cell using basically a confocal microscope. So once we've got all of these reads, we then need to do some computational work. And that's what we will be doing today with our R-Bowtie (now HISAT2) algorithm, which is based on the Burrows-Wheeler algorithm for mapping the short reads to the genome sequence. And we won't go into any detail here, but suffice it to say computer scientists have thought a lot about how best to do this efficiently. There's some indexing that makes this actually a fairly quick process. So basically we can use RNA-Seq to identify both transcript abundance as well as alternative splicing events. So, if on the left hand side, we've got the transcripts from Gene 1 more transcripts then those transcripts from Gene 2. What we can do is we can convert those transcripts into cDNA, and hopefully that conversion is representative. And then we can generate short sequence reads from our cDNA population. By shearing the cDNA and adding linkers, as in the previous slides. And then what we need to do is to map those short reads to the reference genome and we just count the number of reads at a given position to determine the transcript abundance. So if our reference genome looked like this and we saw this number of reads mapping out a given position. We could use that to determine transcript abundance. In this case, higher for Gene 1 versus Gene 2. And if we saw no mapping at a given exon position, we might assume then that alternative splicing had occurred. Such that only exon one and exon two were retained in the transcript. So that's how RNA-Seq can be used. We talked about that actually in Bioinformatic Methods I. If you want more information on next generation sequencing you can actually go to the last lecture of Bioinformatic Methods I. So in the case of both of these, all of these data sets, we need to do some background correction and normalization. Background correction is more important for microarrays, whereby normalization is applicable to all of the three technologies. The reason for background correction is that, in normalization, is that the numbers that come out of the scanner are in units of counts per pixel from the photomultiplier tube, or in the case of RNA-Seq, they're transcript counts, map counts, read counts, and these units are somewhat arbitrary, they're not absolutely comparable between samples and they are possibly not linear multiple of the abundance of what you're trying to detect. So normalization removes any trends that correlate with variable not expected to influence gene expression changes. It's often informative to look at plots and we can often scale the data by log transformation. And express as the log of the ratio of the expression value to a pseudo control versus log intensity, and that's what we're seeing here, so this is the intensity on this axis, and this is the log of the ratio on this axis, and this is a MA plot. And what we're looking for in the case of arrays is a nice distribution of points, roughly symmetrical about the line of no change. And this cloud should have sort of an oval shaped to it. So in the case of microarray data, we can use RMA or we can use GCOS. Our lowess, locally weighted linear regression, to smooth the data. That's quite useful for cDNA microarrays. In the case of RNA-Seq we can use something called reads or fragments per kilobase per million reads mapped to standardize the data to a per million read basis. You can all see these transcripts per million or trimmed mean of m-values method. And DESeq implements some kinds of normalization for RNA-Seq data. And TMM is nice because it actually takes into account if the RNA, the rRNA populations weren't removed as efficiently from one library versus the other library. However, what we'll be doing today is RPKM normalization. So these are some pictures of microarray data before and after normalization. We are looking at box parts here which depict the median values before normalization for five microarray samples. And the top and the bottom of the boxes represent the 25th and 75th percentiles. And we can see that before normalization our medians are sort of all over the place and when we look at the MA plots (just another way of visualizing the data) we see that these clouds of points, of all of the data points aren't really showing this nice, symmetric behavior around the line of no change which is the assumption for RNA profiling experiments. After normalization, on the other hand, our median expression levels are the same across all data sets, our 25th and 75th percentiles are also very comparable one to another And these MA plots look a lot nicer. So this is a brief statistical primer for microarrays and RNA-seq experiments. Just be aware that in the case of Affymetrix chips there is an approximately 0.2% false positive rate for the same RNA, labeled the same and then just split into two samples, hybridized two different chips. This false positive rate is higher for replicate RNAs by a lot. So for RNA that's prepared from two biological replicates a height to two chips, and this means, the take-home message here is, that the biological variation is higher than the chip to chip variation. In the case of cDNAs, the cDNA microarrays, the chip to chip variation is higher, so we may need to do more technical replicates. And for RNA-seq, the response is much more linear and hence the false positive is lower. But this still holds true that the false positive rate for biological replicates does exist, so we need to be aware of the variability as with any scientific experiment. The biological variability needs to be considered when we're designing our experiments. So, could we do an expression study for three biological factors using four samples? Probably not. First of all, we are very likely to encounter a large number of false positives. And you do need to consider statistical design when you're putting together, thinking of doing a gene expression profiling experiment. We need to consider the sources of variability in profiling experiments. So that the things that we are interested in the types of sample, the treatments, the time. In the case of the individual samples, we need to think about the hybridization effects, the dye effects, library construction bias in the case of RNA-seq. So we really do need replicates, and we need to do proper statistical analysis to get our significantly differentially expressed genes. So how can we select significantly deferentially expressed genes? There's a nice article from 2003 in Genome Biology that describes different statistical tests that we can do. In the case of microarray experiments. One non-test that is commonly used is the fold change. And we just say we're going to take all the genes that are increased by level of two fold or higher under a particular challenge relative to the control. This isn't a statistical test. In the case of microarrays they're really subject to bias as low expressers have a higher variance. We could do a t-test or some modified version of a t-test which is an S-test, or regularized t-test, or use a B statistic to select significant genes. For RNA-seq data, people are typically now using negative binomial distribution -based tests to identify significantly deferentially expressed genes. There is a considerable problem with multiple testing in the case of selecting significant genes. And we really would like to use a p value when we do a significance test. However, if we are doing large numbers of tests, this is the problem with multiple testing and what we would want to do is, we could do a very strict correction to deal with this problem of multiple testing. Such as a Bonferroni correction, and what we would do in the case of Bonferroni correction, would be to adjust our p-value from say, a typical p of 0.05, divide that value by the number of tests we're doing, the number of genes we're testing. So that p value would be adjusted to p 0.0005 for 100 genes. But typically, we're testing 10,000 genes, so this p value cutoff would then become very small. What's commonly used nowadays is a false-discovery-rate control. And what happens here is that the columns of the data are permuted, and then we can figure out, we can come up with a false discovery rate that we are happy with, based on permutation of the data. And then asking, okay, once we've permuted the data, how many significantly differentially expressed genes before do we come up with in the absence of the correct ordering of the data. In the case of more than two conditions, its appropriate to use an ANOVA with one of the three flavours of the F test. So now we've got our significantly differentially expressed gene list. It still can be very long up to thousands of genes, and further data analysis is necessary to make sense of large amounts of information, and this is essentially an organizational problem. How can we organize expression data? Well, we can group genes with similar expression profiles and we can also do some other kinds of analysis that we'll talk about next week. Once the data have been organized, we can leverage those data for hypothesis generation. So just to talk briefly about it. Organizing expression profiles, grouping similar expression profiles. We can do hierarchical clustering and what we're doing when we're doing hierarchical clustering is we're taking unordered data. Here a bunch of different samples in this direction and many genes in this direction. And the colouring denotes expression level above or below control sample, and you can see that there's really no pattern in this random ordering of genes, but once we cluster the genes, we can actually start to see groups of genes behaving in similar ways in some of these samples. And we can start to zoom in on those genes and ask oh, what's special about those relative to, say, this set down here? So we'll talk about that next week, and we'll touch on the objectives of both of these labs next week also. Hope you enjoy the lab, and see you next week, bye.