This is the the computer lab. Tian Feng and I will talk about two popular tools for RNA-Seq analysis – TopHat and Cufflinks These tools will be introduced in this video. The first four are command line tools on Linux or Mac OS. cummeRbund is an R package. This is the RNA-Seq analysis pipeline. We get the raw data, map reads, assemble transcripts, calculate transcript abundance, analyze differential expression, and visualize the analysis result. This video will tell you about every steps. Let's talk about TopHat first. TopHat is a read mapper for RNA-seq. Let’s see how it works. First, reads are mapped to the genome by Bowtie. Then reads are piled up and assembled to generate consensus. A consensus can be a real exon, or consist of several exons, or something else. We know that canonical junction sites have sequence symbols such as GT and AG. Next, TopHat will search through the consensus to find and get all possible junction sites and their combinations. For unmapped reads, TopHat will build their index and map them to these assembled junction sites. In this way, junction reads are mapped to the reference genome accurately. Let’s talk about some important command-line options of TopHat. First is the insert length option. In RNA-Seq, mRNAs are split into small fragments, filtered by length, and sequenced. If the expected fragment length is 300bp, and it will have 75bp sequenced from each of both ends, then the insert size should be set to 150bp. The option below is for the standard deviation of the insert length. If the length distribution is concentrated on a certain value, you can set a small value, otherwise you should set it larger. The -G option is to provide an annotation file. If your genome has been well annotated, it’s better to provide this file. TopHat will map reads to the transcriptome first, and then map unmapped reads to the genome. Finally it will merge the two results. We know that most transcriptomes are much smaller than their corresponding genomes, and junction reads can be mapped to transcriptome directly. In this way, mapping can be more efficient and accurate. Let’s talk about library type. Standard Illumina platform is strand-inspecific. In other words, we can’t tell which one of paired reads and the transcript are in the same direction, and which one reverse-complements the transcript. For strand-specific data, there are two cases. In the fr-firststrand method, the second read is in the same direction as the transcript and the first read reverse-complements the transcript. The fr-secondstrand is just the opposite. So you should be careful about whether your data is strand-specific or not, and how strand-specific it is. This is a demonstration of TopHat. The dataset used in this video is GSE32038. You can download it from the GEO website. This is a simulated paired-end RNA-Seq dataset. It has two conditions, and both conditions have 3 replicates. Here C represents the condition, and R represents replicates. I will use C1 R1 for demonstration. We need some preparations before running TopHat. First, we need a reference sequence, or the genome sequence, in fasta format. This is the RNA-Seq data for Drosophila melanogaster.We have downloaded its genome in advance, which is this file.Let’s take a look. Since TopHat uses Bowtie2 to map reads, we need to build index files for Bowtie2. We can build it by this command. This process takes some time, so we skip the running and check the result. These files ending with “bt2” are the index files for Bowtie2. We also need the paired-end reads files in fastq format.The two fastq files are ended with “_1” and “_2”, respectively. This is what the files looks like. Every 4 consecutive lines represent a read. The first line is its name.The second line is its sequence.The third line is a plus symbol. The fourth line is a series of symbols describing the quality for each base above. In simulated data they are all “I”. In practice, we need to do for the raw data some very important preprocessing steps, including quality evaluation and filtering. Then we need to prepare this annotation file, though it is not a must. This annotation in GTF or GFF3 format, if annotated well enough, can be downloaded from UCSC. Let’s take a closer look. Let’s check the first line.For example, this is an exon, and it is on this chromosome. It starts from here and ends at here.It resides on the plus strand and belongs to this gene and this transcript, and it is also the first exon. Such annotations can provide for TopHat information of splicing sites. We can run TopHat once these are ready. Here -p, -G, and -o denote the number of threads, the annotation file, and the output directory, respectively. Following the options is the indices of reference sequence, and finally there are the two reads fastq files. Now let’s type this command to run TopHat. Because it really takes some time to finish, we will just skip the running and check the result we have produced before. Let’s take a look at the C1_R1 output directory of TopHat. First let’s check the logs. The tophat.log file describes how TopHat ran in the whole. Now let’s check the align_summary file which describes some statistics of reads mapping. We can see that all reads have been mapped back to the genome, 1.6% of which are mapped to multiple loci. Since the data is made from simulation, the proportion of mapped reads is 100%. This can hardly be true in practice, and it would be very nice to have such proportion reaching 90%.Of course, 60%~70% is also acceptable. Now let’s check the bam file. This file describes how reads are mapped back to the genome. We need to use samtools to view this binary file. Let’s take a closer look. Each line of this bam file describes a mapping of a read to the reference genome. For example, let’s check this line. This number describes various aspects of the mapping.For example, whether this read is mapped or not, to which strand this read is mapped, whether its paired read is also mapped, etc. This is the name of the reference sequence to which the read is mapped. This is the position of the leftmost base of the read on the reference sequence. This is the quality of mapping. The 75M tells us that all the 75 bases of the read have been mapped back to the genome. Here’s another different case. The first 66 bases of the read are mapped to the genome, following by a 76-bp gap and then another consecutive mapping of the last 9 bases of the read. This read has been mapped to a junction site. This is the position of the read paired with the previous read on the reference genome. This is the length of fragment from whose two ends the paired reads are sequenced. That’s all for TopHat.Now let’s talk about Cufflinks. Cufflinks is a toolkit for transcript assembly, expression estimation, and differential expression analysis. Let’s talk about the transcript assembly and expression estimation first. As mentioned before, reads are generated from transcripts. If a gene has multiple transcript structures, such as the red, blue, and yellow transcripts depicted in this figure, then we cannot tell from which transcript each read comes from, except for those reads colored in red, blue, or yellow. Then how can we reveal the truth as much as possible? Cufflinks aims at assembling the most likely transcript structure and estimate its expression level. Now let’s explain some important command-line options of Cufflinks. The -G option provides an annotation file for Cufflinks to assemble transcripts that exist ONLY in this annotation. The -g option also provides an annotation file, but Cufflinks will assemble new transcripts guided by the known transcripts in this annotation. The -u option tells Cufflinks to use a method that is more accurate to process multi-mapped reads. Without the -u option specified, Cufflinks will only simply distribute the reads uniformly. For example, a read that is mapped to 10 positions will make Cufflink assign to each position 0.1 read. With the -u option specified, Cufflinks can distribute the reads more accurately. For example, the reads will be distributed evenly first, and the probability of each read being distributed to each position will be estimated by the expression levels of the 10 positions.the expression levels of the 10 positions. In fact, Cufflinks will use EM to iterate and get the read distribution that is most likely given the current observed data. The “library type” is similar as TopHat’s. Below is a demonstration of Cufflinks. These options have been introduced before, and this bam file is the result of TopHat we have just run. Let’s run this command. This will take some time to finish, so we skip this and check the result directly. Let’s take a look at this output directory by Cufflinks. This file describes the transcripts assembled by Cufflinks. Let’s take a closer look. Last section we have introduced Tophat and Cufflinks, now we will show how to navigate Cuffmerge, Cuffdiff and CummeRbund. When will process several samples using Cufflinks, we need to merge all the transcripts data into a more comprehensive transcript. Cuffmerge is a tool to merge gtf files into a gtf files with more comprehensive gene annotation. As the figure shows, the 6 transcripts were merged into one. At the same time, the output can be more accurate with the genome annotation available. This merged assembly (transcript) provides a uniform basis for calculating gene and transcript expression in each condition. Now let’s make acquaintance with the options of Cuffmerge. -g：An optional "reference" annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output. -p：Use this many threads to align reads. The default is 1. -s：This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. It is a simple example of Cuffmerge These options have been mentioned below. The last item is a list which consists of the file paths of assembled transcripts from Cufflinks. Now let’s run Cuffmerge. First of all we need input “cat” command to create a list, which consists of file paths of all assembled transcripts. Then we run Cuffmerge. All the output files are in merge_asm fold. the merge_asm fold includes a fold named logs and a .gtf file, which is an integrated transcript file. When we get assembled transcripts using Cufflinks, we can calculate the expression lever of different transcripts. The basic principle is that given exon length and sequencing depth, the number of reads is proportional to the number of corresponding transcripts, so we can calculate the expression lever of transcripts by counting reads.Also Cuffdiff can calculate the significant difference of gene expression lever between samples on different conditions.expression lever between samples on different conditions. Now let’s make acquaintance with some important options of Cuffdiff. -u Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. -L Specify a label for each sample, which will be included in various output files produced by Cuffdiff. It is a simple example of Cutdiff.These options have been mentioned below. The next is gtf file from Cuffmerge. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. If you have more than one replicate for a sample, supply the BAM/SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Now let’s run Cuffmerge. This is the result of the operation. All the output files are in the diff_out fold, which includes the analysis of some groups, such as the same TSS or the same CDS. We can perform further analysis with them. After performing the analysis, we hope to get visualized consequences. CummeRbund has an excellent capability in drawing as a package aimed to process data. aimed to process data. We can also input these codes in R environment to make it downloaded and installed. Common drawing methods are: density distribution; scatter; volcano diagram and histogram. The inputs of these functions are all processed data from Cuffdiff. These letters are supposed to marker different samples. Now let’s show these functions described previously. First let's enter R environment. We load the installed package "CummerRbund",then load the data from Cuffdiff by the "readCufflink" command Now we are going to plot. This is a density distribution of transcript expression levers. This is a scatter of transcript expression lever on two different conditions. The transcript expression levers whose corresponding spots lie on different sides of y=x line have condition bias. his is a volcano diagram. The horizontal axis represents the logarithm of the ratio which is between the expression levers of specific genes under different conditions. The vertical axis means the negative logarithm of p-value in the T-test. So the genes whose spots are above the cutoff line have differential expression under different conditions. We can also use getgene command to analyze specific gene. This is a histogram of expression lever of a specific gene. This is a histogram of expression lever of all the splicing isoforms from a specific gene. For more details, please pay attention to websites and the manuals. In the end, thank you for watching. Goodbye!