A transcriptomic analysis would survey the state of the transcript done within a particular cell and it would try to answer the following questions. First, what are the genes that are expressing in the sample? And for every gene, what are its transcript variants? We call that assembly. Then, what are the expression levels of the genes and of the transcripts? We call that quantification. And lastly, we might want to perform a differential analysis looking at how do expression levels and splicing patterns differ between two particular conditions? Say healthy versus disease tissue. So let me walk you through a typical work flow for a transcriptomic or so-called RNA-seq analysis. RNA-seq because this refers to analyzing next generation sequencing data of similar RNA. It starts with the RNA-seq experiments that generates reads as we described a couple of lectures ago. Then the short reads that are typically between 50 and 250 base pairs long, are being mapped to the genome, or using another word, aligned to the genome. Then, their alignments, the realignments on the genome are being clustered in gene-oriented groups to form a basic structure for the gene and for these transcripts. In the following step, a number of transcripts are being read or decoded from the structure and quantified, meaning rates are being assigned to them. And lastly, if we have two or multiple conditions and possibly with multiple replicas, then the analysis would continue with a differential analysis at the expression level or at the splicing level between the two conditions. What I'm showing you here is two conditions, test versus control, each one of them with three replicates. A crucial step in this analysis is the part that assembles the transcripts and then quantifies them. So I'm going to spend a little bit more time explaining those. What you see here on the left hand side is the computational process that creates a representation of a gene and its splice variants. So we start with an RNA molecule, shown at the top. You see it organizes in an RNA, and as it's exons locations along the genome, and, the portions that are alternatively spliced are shown in red. So, you see here that we have an exon skipping event, exon two escaped, that is included in some transcripts, and excluded from others, and we also have another event that we call intron rotation. That particular intron is sometimes included in the exon, sometimes not. Once the RNA molecule is sequenced, then we have a number of reads, which we show underneath. The reads are being mapped to the genome. If a read falls right inside, entirely inside an exon, then it's going to be aligned as a contiguous fragment. However, if the read spans the boundary between two different exons, it's going to have to be spliced. So we talked already about spliced alignments. Now this type of information when seen along the genome give us two particular types of clues that can inform transcript assemblers in gene-finding methods as to how they can build the most likely gene structure. The first type of information comes from the bulk of the alignments. Because the rates that are coming along the exons means that when viewed along the genome the alignments are going to cluster in columns at the location of the exons. So the bulk of the alignments are going to tell us roughly where the exons are located. The other type of information comes from the splice reads, and they are going to give us information about the introns. The introns connecting the exons. So that's what you see in the next two pictures, the levels of the reads along the genome, and the splice junctions that can be obtained from the splice reads. The next stage of transcript assembly methods typically creates a graph representation that combines the gene and the transcript together into a compact form. There are a variety of graph representations, overlap graph, exon graph, sub exon graph, connectivity graph. But the one that you have shown here is a very simple exon graph. Let me put it simply. Exons are represented as nodes in the graph, and you see them here represented with bars, red or blue. And then we connect two exons or two nodes by an edge if there is an intron that connects them. So that's what we see here. Now we can reverse the graph from a node that has no incoming edge all the way to the end to the node that has no outgoing edge, and we can obtain all the possible splice variants or transfers for the gene. See, if you're looking at our example, we can have four possible splice variants. So the splice graph and other graph representations present information and collect information about the gene and splice variants in a compact form. However, the information that they encode might be much more than what we need, because oftentimes, the number of splice variants that are encoding in the graph far exceeds the number of truth splice variants. So a very challenging question is how do we identify the most likely splice variants the guiding coded in the graph based on the information that we have from reads? So what we see here on the right hand side is how this process is being addressed. We start, let's say, by enumerating all splice variants and in this case we had four, and then by assigning reads back to them, using usually some sophisticated algorithm. Let's say it can be expectation maximization, or a linear program. Once we identify what it is that we transcript, we can calculate and expression level for the transcript, and then we select the top variants as being the most likely to be expressed in that particular sample. So there it is, the two stage process, the graph representation, followed by transcript selection and quantification. So that's the transcript assembly. Now let's look at the entire workflow, and I will point out along the way what are some of the tools, the basic command line tools, that can be used to perform a restage. So we start by mapping the reads to the genome, and we can do so with the tools such as Tophat, STAR or Hisat, for example, but there are others. The next stage is to take this alignments and assemble the realignments into transcripts, using the transcript assembler, such as Cufflinks, CLASS or iReckon. In the third stage, we would reconcile transcripts across multiple samples in case we have multiple samples, and it is necessary because there might not be enough reads in every sample to assemble the transcript fully. And then we would like to know for each fragment, what transcript it comes from. So lastly, once we have a unified reference for what transcripts are being expressed in a set of samples, we can quantify the isoform expression and we can finally perform a differential analysis among the samples. And this can be done with a tool such as Cuffdiff or the more recent Ballgown. So in the next sections, I will illustrate how we can use representative tools from this week. In particular, I will focus on the Tuxedo Suite, shown here with the stock Tophat, Cufflinks, Cuffmerge and Cuffdiff.