Hi, my name is Yen Kou. I'm a PhD student at Doctor Abby Ma'ayan's lab. In this and the following lectures, I will guide the basic analysis of next generation sequencing step by step. I will also introduce the basic usage of mimic slash Unix commands, and software are as bioinformatics tools are built to these open source platforms. Let's first start with the RNA-seq analysis. Before we start the analysis, let's first review some simple facts about RNA-seq. Why does it get popular among the research communities since the past few years? What's advantage of RNA-seq over microarray? What limits of the RNA-seq? And the topic of this lecture what's the standard pipeline to posses RNA seq data? So RNA-seq is a high through put or next generation sequencing method to measure the genome libo transcriptome or RNA content of the human sample. The RNA is extracted from the cell reverse transcribed to cDNA and topped to short sequence, followed by a massive parallel sequencing. The sequence are generated as short and will be mapped to the reference genome if there are any or or sample to the transcriptome if the reference genome is not available. So transcriptome sequencing AKA RNA-seq can evaluate absolute transcript levels of sequenced and unsequenced organisms. Detect novel transcripts and isoforms. Identify previously annotated five prime and three prime cDNA ends. Map exon/intron boundaries. Reveal sequence variations and splice variants and many, many more. During the past five years, next generation sequencing was established for almost all DNA based molecular research field and therefore could be configured as an all you want platform. As more data gets deposit in the public domains. Many analysis can be implementable as microwave data has been export. However, the usefulness of RNA-seq is limited by both experimental and analytical factors. The non-uniformity coverage of the genome introduced bias transcriptinally active genes. Although being improved, RNA-seq is not good at capturing short length transcripts such as micro RNAs. The sequencing error rates of next generation technology is usually low. But as the massive amount of reads are being generated, the mapping is starting to increase and besides, curb two usually fail to map repetitive elements. As for the down stream data analysis, the mapping and exampling algorithm still needs to be improved, and the bioinformatics software needs to be simplified for the use biologists. A more practical bottle neck, the economical benefits seem enough to be sufficient enough. Different research groups and companies are currently working on developing the third generation of sequencing. And in the near future, for example, sequencing of human genome can be low as less than 1,000 box. So this is just a simplified introduction about RNA-seq if you'd like to know more, I'll list several reference reviews at the end of the slides. Now let's get down to data processing. When measuring the transcription, the most common task is to identify differentially expressed genes between treatment and control experimental conditions. Sometimes the experiments are done at different time points or different dosage, and we'd like to know the differentially expressed genes across a series of conditions. Therefore, the example data we use in a tutorial, is designed as a knock out and one type samples from three different time points. We will identify the differentially expressed genes between knockout and wild type at each time points as well as the differential expressed gene between the neighboring time points using the timeline of one of the most popular RNA-seq analysis software called TopHat and Cufflinks. This is a workflow of the RNA-seq analysis pipeline. Once we get a short read from sequencers output, the first step is to map it to the genome using TopHat. Here we're dealing with the situation where a reference genome for an organism Is available and well annotated. If no reference genome is available, we will need to assemble the transcriptome using the Denovo method. The details of the Denovo transcriptome assembly can be found in the reference. TopHat is a sequence aligner designed specifically for RNA-seq. The request, the regular next generation sequence alligners, fail to map RNA-sq since there discontinuous sequences that connect two exons. TopHat breaks down these junction reads, and searches for a best neighboring exons that yield those sequence. The next step is to example the transcription with Cufflinks. It takes a line of data from TopHat, and measures the transcription level by denormalized number of reads mapped to the defined regions, such as transcripts, genes, and isoforms. Differentially expressed genes can be identified using the Cuffdiff module in Cufflinks. Once the transcriptome is constructed, we will be able to to visualize our RNA-seq data with RNA package CummeRbund and proceed for further analysis. Including hierarchical clustering, pathway analysis, enrichment analysis, principal component analysis, etc. I will introduce CummeRbund in the following lecture, and these popular data methods will be covered by the rest of this course. One important thing to bear in mind for a data analysis is that, always check the quality of the data before any further analysis. For example, what's the quality score of distribution of these sequences? How much of the sequences can be mapped to the genome? Are expression levels of different samples within the same magnitude? Unlike microarray, the number of replicates of our RNA-seq is usually limited by the amount of biological material and economic cost. Therefore, the data analysis workflow is slightly different for the two situation. For example, for experiments without replicates, the transcriptome assembly and differentially expressed genes identification can be combine as one step using the Cuffdiff functionality in Cufflinks. When replicates are available, it's recommended to first example the transcription imagery for each sample, and then merge the replicates into one example afterwards. Differentially expressed genes can be then identified using the Cuffdiff once the transcription is constructed for each biological condition. Cufflinks has provided smooth functionality to handle both tasks. Using the TopHat and Cufflinks path client is challenging for biologists since it's designed to run on the UNIX/LINUX system. Although most of the popular LINUX distributions come with graphic user interface TopHat and Cufflinks do not have such utility. And everything will be invoke and setup from the command line, therefore, we will first learn some basic UNIX/LINUX commands before we can start. The commands will be demo in the LINUX system. For Mac users, simply open the terminal and commands will do exactly the same as shown in the lecture. For Windows users, since the default command prompt is not LINUX based, we need to mimic a LINUX environment with a software Cygwin. I've written the step by step instructions about how to install Cygwin in Windows. You can find it in the Simon for this lecture. First, let's open Cygwin. The default directory when you open Cygwin is the home user directory. Cygwin treats the file system differently as Windows. It puts all disks under one folder, called Cyg drive I will show it later. To find out the current directory, we can use the command pwd. Next, we want to see what's in the directory and we use the ls command. Right now it's empty, so nothing shows up. Now let's go back to the root directory and take a look at the file organization. We can use the cd command, short for change directory, to move around. you can also type the path to which you want to move to or use the dot dot slash to move one level up. Here, I used twice to move up to the root. List again and you can see the root folder content. It's similar to the Linux system but not exactly the same. I mentioned that all this sort of Window system is put under the Cyg drive folder. I scanned it there and take a look. List again, we can see two disks, C is the default drive and Q is my external hard drive. Move to c drive and list all files, now we can see something similar. We have the Documents and Settings, Users, Program Files, etc. The example I put to show, the example I want to show is in the document folder. So cd to there, here it is. The tutorial folder. I'll put a text file in this folder. It's actually a genome T-cell mouse genome. We'll touch on later. First, let's do some simple operations. To create a new directory, we can use mkdir command followed by the name of the new directory for a new dir, for example. Now list it again and we can see the new directory we just created. Of course, it's empty cuz we haven't put anything in there. Let's go back and copy the text file to the new folder. The Window-style copy and paste can be done by using one command cp. Simply type cp followed by the name of the file we want to copy and the path to the directly and it's done. The slash means this is directory rather than the file. Go to the new DIR and it's already pasted. Well do you wonder what about cut and paste? In Linux the cut and paste operation is combined with the rename a file meaning that these two share the same command. For example, if we type the file name followed by a new file name, the file will be renamed. But, if we type a directory instead of new file, the file will be cut and pasted to the directory and the original name of the file will be kept. Here, we can see this file is renamed. Next I wanna show you how to view the top or bottom of a file. This is a particularly useful to deal with genomic files since they're usually very large. And regular text editor either won't open them or dramatically slows down your computer. But genomics files usually contain structure data, therefore we can view only the first few lines or last few lines to get a sense of what the data looks like without opening the entire file. The commands are head or tail. Using head followed by how many lines you want to see, followed by the name of the file, we will get a first a few lines show in the terminal. For example, here I wanna see the first five lines of this, mouse genome file. It's a bit hard to tell how many lines are here since they're wrapped but if we count the genes we can see it's five, same as had when we're using tail. We, we're showing the bottom five lines of the text file. One of my favorite Linux command when dealing with large files is command that counts how many lines are there in the file. You remember how annoying it is to open a large table in Excel and drag your mouse down to the end just to see how many rows are there. While in Linux, you can just type wc -l followed by the filename and the number of lines which showed up instantly. Here we have 30,000 something transcripts in the mouse genome. The last thing I want to show is a grab command it's extremely handy to handle simple text files. For example, in this genome annotation file each line contains a transcript, therefore for a single gene, multiple lines exist if it has more than one transcript. Now, I want to see all transcript for Brca1, and can use a grep command to extract all lines that match the string Brca1. It turns out, Brca1 has only one transcript. Using grep, we don't need to give the full name of the gene. For example, I want to see all transcripts of genes that start with fut I can type grep fut no numbers, and I'll get a full list of genes beginning with fut. As I mentioned, grep has a lot of parameters we can play with. Where to find them? As a matter of fact, most Linux commands come with a set of parameters. To use them, we need to read the documentation about the commands. Functionally this can be done in the terminal. This is what Linux group call manpage simply typed man followed by the name of the command and the full documentation about the usage will show up in the terminal. Name, synopsis, description, etc. You can type Q to quit the man page and get back to the terminal. So to remind you, this is a screen shot of the commands. And if you are still cloudy about these user unfriendly commands, get a new laps view like this, which has printed all essentials upside down. The last prep before we start a pipeline is about a format, the bio formats. As we know, sequencing data are very large, therefore it's stored in specific formats. The most frequently used format is Fastq. For each short-reads, four lines of information is written in the fast profile. The first line starts with @, it's the sequence ID. It's usually generated by the machine. And the second line contains a sequence of the short rays in ACTG codes. The third line starts with the plus symbol, and it's optional. The fourth line contains the same number of symbols as the sequence in the second line, and it encodes the quality value of the sequence. The quality value can be raised by some software to evaluate the sequencing quality. Each Fastq file aggregates millions of these reads in the same format. TopHat takes the Fastq file as input, align each reads the genome and generates a band file that contain the chromosome coordinates of each of the read. The Bam file is a binary compressed and indexable representation of the alignments. Therefore it's not human readable but is convenience for visualization with the genome browser. Some well established tools such as Sam tool or Bam tools can be used to work with the Bam files. Here is the list of the recommended reading. The new articles give a detailed introduction about computational methods for next generation sequence analysis. The two other papers describe how does TopHat and Cufflinks work. In the next lecture, I will get you through the TopHat and Cufflinks pipeline step by step. Thanks for watching. I'll see you in the next lecture. [MUSIC]