Welcome to chapter 3. Annotating the SARS-CoV-2 Genome. In this chapter, we will discuss how we can justify the SARS-CoV-2 Genome sequence. Up until this point we have assembled a viral genome but it's still a bunch of uninterpretable As, Cs, Gs and Ts. How can we actually understand the virus from this? For this we shall explore the concept of genome annotation. But before we even discuss genome annotation, we need to discuss the Central Dogma of Molecular Biology. Which is the process by which genetic information propagates from one generation to another. The DNA, can be thought of as a book that contains all the instructions for creating all proteins of a given organism. It has broken into chapters, which we should call Genes. And E gene contains the instructions for creating a specific protein. As an example, let us imagine we want to build a bicycle. A single gene is transcribed into a special type of RNA called messenger RNA or mRNA for short. Which you can think of as essentially being a photocopy of a single chapter in our book. The mRNA contains a copy of the instructions required for creating a specific protein. In our bicycle example, we have photocopied or transcribed the chapter of our DNA instruction book that describes how to build a bicycle. Finally, the RNA is translated into proteins, which is the functional unit of Molecular Biology. In our bicycle example, we have used the copy of the instructions or in other words the RNA to actually build a bicycle which is the protein. Of course, as it is with all biology, reality is much more complicated than a simplified summary. But for the purpose of this course, we will only consider the simplified model. As mentioned, DNA is transcribed into RNA, and RNA is translated into proteins. For the purpose of the scores, you can think of the DNA sequence as a string over the alphabet, A, C, G and T. And the RNA sequence, as the string over the alphabet A, C, G and U. And the process of transcription is simply replacing all the Ts in the DNA with Us, as shown below. For example, if I wanted to transcribe this DNA sequence here, I would just replace every T with U to get the corresponding RNA sequence. As mentioned, a special type of RNA called mRNA is translated into proteins. Recall that RNA can be thought of as a string over the alphabet A, C, G and U. Similarly, a protein can be thought of as a string over the alphabet of all amino acids of which there are around 20 letters. Each triplet of RNA letters or cordons and calls for one very specific amino acid, which is summarized in the codon tables shown below. Note that translation usually starts with the start code on, AUG, which encodes for the amino acid methionine or M. And there are three cordons which designated when we should stop translation of RNA. Also known as STOP codons. UGA, UAA and UAG are typical STOP codons. The simplified algorithm behind translation for the purpose of this course is as follows. Start with the START codon, early in the RNA sequence, which is not necessarily the first one. And translate each codon one by one, until a STOP codon is reached. AUG, which stands for methionine is a common START codon. But many bacteria have various different START codons. For an example, let us say I want to translate this RNA sequence. I would start by translating an early START codon. In this case AUG, which is encoded for M. This is the first letter in my protein sequence, as shown here. Then I take the next triplet GCU, and translate that, which gives us A. Then I take my next triplet ACU and translate that, which gives us T, T, H, I. Then we have GCC, which translate to A,S for AGU. And finally we reach UGA. UGA is a STOP codon, and so I start translation here. So we have successfully translated RNA sequence into the protein shown below. >> So in our previous example, we were looking at an RNA sequence instead of a DNA sequence. And we knew exactly where to begin translation. But if we have just assembled the genome of an unknown organism, all we'll have is the DNA sequence reported by the assembly. Now, remember that DNA is double stranded. So the DNA sequence reported by the assembly, is just one of those two strands. Therefore without any additional knowledge, there are six possible reading frames that we need to consider. Three in our DNA sequence, starting at nucleotide positions one, two and three. And three in the reverse compliment, also starting at nucleotide positions one, two and three. Genome annotation tools will consider all six reading frames when predicting what parts of the DNA sequence encode proteins. So here we have an example of a translated open reading frame. And looking at these two questions come to mind. First, does this open reading frame contain a gene? And second if so, where is this gene located? We can answer these questions by parsing our open reading frame and looking for sub-strings that begin with a START codon, and end with a STOP codon. Then similar to how we've used blast, to search nucleotide sequences that we can query our potential protein sequences against massive databases of annotated protein sequences to find identical or highly similar proteins. And with news these matches to predict properties and/or functions of the proteins we found. For our analysis, we will use Prokka, as our genome annotation tool of choice. Prokka was developed by Torsten Semen and leverages numerous bioinformatics tools and probabilistic models trained on previously annotated genomes. You can follow the instructions in our course, to learn how to run programs on our stars cove to assembly. But in this course we will only use program and not discuss the algorithms that use this under the hood. If you're interested in learning about those algorithms, you can read the related chapters in the textbook Bioinformatics Algorithms and Active Learning Approach. Now, Genome Annotation tools like Prokka, tried to predict features of a genome using probabilistic models. As the famous quote goes, all models are wrong, but some are useful. And as such, while Prokka's annotation predictions are useful, they are certainly imperfect. Fortunately for us, by the time you are taking this course, virology researchers have thoroughly curated the SARS-CoV-2 genome and a high quality annotation is available for us to compare against. Even in the early stages of the pandemic program was just one of multiple potential genome annotation tools. And researchers would want to be able to use multiple tools and compare the resulting annotations. For this comparison we will be using the Integrative Genomics Viewer or IGV to visually compare are predicted annotation against the high confidence SARS-CoV-2 genome annotation. Essentially IGV will let us load multiple annotations and display them in a way that makes them easy to compare directly. IGV also lets us interact with each annotation to look at its details a bit more closely. Now, assuming are predicted genome annotation is accurate. How does it actually help us study this novel coronavirus? It turns out that if we incorporate some existing knowledge from neurological research, we can use the predicted annotation to gain some insights into the functional mechanisms behind how SARS-CoV-2 invades ourselves, that is the spike protein. To the left we see a visualization of the Coronavirus. And to the right we can see a close up of a specific protein of interest, the spike protein. Something Prokka Annotation is able to identify. Despite protein is essential for the virus to enter a cell. It consists of two functional sub units S1 and S2. S1 forms the globular head that recognizes and binds to specific receptors on the surface of human cells. S2 directly embedded into the viral particle surface, mediates the membrane fusion process. That is the merging of the viral particle and the cell, eventually allowing the virus to invade.