Welcome to Command Line Tools for Genomic Data Science. Today, we'll be talking about Tools for Transcriptomics as part of lecture number four. Well, we'll start with a brief overview of the molecular biology concepts, primarily talk about genes and transcripts, then represent the standard workflow for a transcript [INAUDIBLE] analysis. And lastly, we will do a hands on analysis using basic command line tools of transcript [INAUDIBLE] data. So let's get started. They occupy a specific location with an initial point, an initiator and an ending point. When one of the two strands of DNA and they have a very specific organization. They are formed of informative blocks between core exons and interspersed with uninformative blocks which we call introns. Now where they are located, where they are housed in the genome. Genes also take a body of their own and they do so during the process of gene expression or saying that the gene is expressed. The process of gene expression is complex and it starts with transcription. During which an polymerase creates a copy of the gene. Nucleotide by nucleotide from the beginning to the end is a single stranded RNA molecule, we call this as pre-mRNA. This contains the information the exons interferes with the introns. This molecules still located in the nucleus is highly unstable and it suffers a number of modifications to stabilize it before it gets to the next step. There is first capping at the five prime end, then the cleavage of the 3 prime end and addition of a long poly(A) tail. But perhaps, the most important for our work today is the process of splicing. During splicing, exons are being splice, are being adjoined together and introns are being spliced out and the result is a so-called messenger RNA molecule. Messenger RNA now travels from the nucleus into the cytoplasm where it is later translated into a protein. So overall, this is called gene expression. As you can see here, we have one gene and one mRNA with three axons and these form one protein. However, under different circumstances, for instance, in different tissues or different conditions of diseased versus healthy tissue, we might have the different combinations of the genes exons might be used to form the messenger RNA. We also call the messenger RNAs transcripts. So in this particular case at the bottom, you'll see a messenger RNA that only contains exons one and three, exon two escaped. We call the process of exon escaping and the process by which a gene can express different combination of exons alternative splicing. So we have genes and each gene can produce multiple splice variance, multiple transcripts or so-called isoforms. And we would be using this terms, isoforms, splice, variance and transcripts interchangeably throughout today's presentation. The examples that I've shown you is a very simple one. It's the case in which one exon can be present or can be excluded from the messenger RNA. What I'm showing you here is a much more complicated example at the BRCA1 gene. You will see a view from the UCSC browser and on the line you see a representation of a different transcript of this gene. The vertical bars are used to represent exons, whereas the horizontal bars or lines that connect them are introns. Obviously, introns are much larger than exons. What you see here is that every transcript represents a different combination of exon. For instance, you will see that the transcript that this exon, exon number eight appears in only one transcript, the third one from the bottom. You also see that the transcript at the very bottom skips a large number of exons. Exon two, three, four and all the way through to the last one. So there is a large variety of transcripts and to look at how daunting alternative splicing is consider that 90% of the human genes are alternatively spliced to form two or more isoforms up to potentially thousands. Many of the aberrant isoforms has been associated with diseases. So determining the repertoire genes and determining the repertoires of isoforms in general species order under particular cellular condition is very important. Let's look at the very high level state of the cell in the content of the cell. At any given time, there is one copy of the genome and then both in the nucleus and in the cytoplasm, we'll have multiple genes being expressed, each genes would potentially multiple copies of mRNA's and with multiple possible transcript variants. So any biological or clinical analysis, we try to interrogate the status of the sell, particular the status of the transcripto by asking the following questions. What genes are expressed in the sample? What transcripts are expressed for every gene? Then for each gene, how many copies of the gene do we have? And how many copies of each type of transcript? And lastly, assuming that the analysis is trying to compare disease versus healthy tissue or any other two different conditions, how do the expression levels and splicing patterns differ between the two conditions? So these are the questions that are being tackled by typical RNAC or analysis and we're going to be looking at that next.