Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. Let's continue our topic: Transcriptome study by deep sequencing technique In the last unit, we briefly introduced the basic background and experimental measurement techniques of Transcriptome study. In this unit, we will explain the specific methods of RNA-Seq starting from reads mapping. In the fifth week, we mentioned that reads mapping is usually the first step of deep sequencing data analysis. Its quality and speed will directly affect the subsequent analysis work. Also based on the deep sequencing technology, reads generated by RNA-Seq has similar properties with DNA reads generated by genome re-sequencing in terms of length, quantity, quality and other aspects. Also,based on deep sequencing technology, reads generated by RNA-seq has similar properties with DNA reads generated by genome resequencing in terms of length , quality ,quantity and other aspects For example, they both have the short length, large quantities, uneven quality and high error rates. However, the RNA-Seq sequencing data also has its own characteristics because it’s from the RNA transcript. Specifically, in the process of transcription form DNA to mRNA, introns are cut out and exons are ligated together in the splicing sites. For the reads across the splicing sites, also known as junction reads, if you don’t break them from the middle, they will not be accurately mapped to the genome. These junction reads are the direct evidence to determine the splicing sites. They are crucial for proper reconstruction of the transcripts’ structure. For example, in this figure, the junction reads across exon 1 and exon 3 directly support the existence of the transcript with exon 1 and exon 3 ligated directly, without exon 2 included intermediately. Similarly, in the figure blow, the two kinds of junction reads respectively support the existence of the transcripts with exon 1and exon 3 directly ligated and the transcript with exon 3 and exon 5 directly ligated. Therefore, our algorithm for mapping needs to take into account the junction site and intron, so as to properly deal with these junction reads. Specifically, there are two main kinds of strategies to this problem in the present. One is Join Exon. The first step in this strategy is to build all possible junctions based on all the exons in the known transcript. It should be noted that the junction in this library may not be known, but includes all possible combinations. For example, 4 exons correspond to six combinations. After that, the usual mapping is carried out where non-junction reads are mapped to the genome in an unspliced way similar to those DNA reads. For those junction reads that cannot be mapped directly, we align them to the junction library constructed in the first step. In fact, the Join exon strategy serves as a patch for the previous DNA reads mapping algorithm. This strategy can discover new splicing isoforms by constructing all possible junction libraries. However, it can do nothing for unknown exons. We can turn to split reads strategy to handle this problem. Similar to previous mapping algorithms for DNA reads, the split reads strategy will also first map non-junction reads to the genome in an unspliced way. For those junction reads that cannot be directly mapped, they will be sliced into multiple seeds with length k to retry the mapping, which resembles the BLAST method. In other words, this strategy tries to find junction site at a finer granularity. Finally, mapped seeds that are close to each other are combined to obtain the final whole read alignment. Compared with the previous Join exon strategy, the split reads strategy is slower as it needs to map seeds which are shorter than reads. However, this strategy does not depend prior exon annotations, and can discover new exons and even new genes. In fact, current common RNA-Seq tools often combine these two strategies together to balance the sensitivity and speed. For example, the TopHat2 tool co-developed by John Hopkins, Berkeley, and Harvard tries to first identify known junction sites quickly by the Join exon strategy, and then use split reads strategy to discover new junctions. A noteworthy feature of TopHat2 is its usage of different indices for different strategies, which can further increase the speed of mapping. Mapping is only the first step of RNA-Seq data analysis. We still need to assemble these reads into transcripts, and estimate their expression levels. After correctly mapping all reads, including junction reads, we can interpret the transcript assembly problem as a traverse problem on a directed graph. We can use the path finding algorithm from graph theory to find one or more optimal paths and their corresponding transcript sequences under the constrain where different edges are assigned with different weights. We will illustrate the basic idea by the commonly used tool Cufflinks. Cufflinks is a tool for transcript assembly and expression analysis based on RNA-Seq data. Let’s check how Cufflinks works. Assume that we only observe the reads and do not know there are these three transcript structures. What shall we do? First, Cufflinks will try to find out fragments that are impossible to be present in the same transcript. For example, the yellow and blue fragments here are not possible to exist in the same transcript. The reason is that if they existed in the same transcript, the yellow one would break at this position of the blue one instead of skipping it. Likewise, the red, yellow, and blue fragments are all mutually exclusive, and two fragments of the same color are compatible We can then obtain the overlap graph by treating each fragment as a node and connect all fragments that are compatible with each other. Guided by the parsimony principle, Cufflinks will try to find out as the optimal path the “minimum cost path cover”, which has the smallest number of paths that can cover all reads and have no overlaps. The final set of the three transcripts are thus obtained. In principle, the expression level of transcripts can be inferred directly from the expression levels of exons once the transcript assembly is done correctly and the expression levels of these exons have been correctly normalized, as described in the previous unit. For example, assume that we can infer from the three exons on the genome the two transcripts, t1 and t2. Meanwhile, assume that the normalized expression level of each exon can be determined: e1=20, e2=40, and e3=60. Then we can directly infer from the transcript structure the relationship between transcript expression levels and exon expression levels. For example, the exon 1 is only present in transcript 1, so all its expression is contributed by transcript 1. Similarly, the exon 3 is present both in transcript 1 and in transcript 2, so its expression is contributed by both transcripts. Therefore, we think that the expression level of exon 1 is the expression level of transcript 1, while the expression level of exon 3 is the sum of expression levels of transcript 1 and transcript 2 We can then infer that the expression levels of transcript 1 and transcript 2 are 20 and 40, respectively. Of course, this problem becomes more complicated in practice as we consider the fact that the transcript assembly algorithm determines the nature of reads distribution. For example, in Cufflinks the reads are distributed with respect to other factors, such as length distribution. In fact, transcript assembly and expression level estimation are often done by EM and other iterative algorithms to further accurately estimatethe expression levels. Good. Starting from RNA-Seq reads mapping, we have made a brief introduction of methods for transcript assembly and expression level estimation based on RNA-Seq data. You can check more technical details in the Computer Lab video course in this unit presented by Mei Hou and Feng Tian, two students from Center for Bioinformatics, Peking University. students from Center for Bioinformatics, Peking University. Here are some summary questions You are encouraged to think about them and discuss them with other students and TAs in the online forum.