Chapter 2, sequencing the virus. In this chapter we'll discuss how researchers can sequence a viral sample from a patient. But first, let's talk about DNA sequencing in general. As an analogy, let's imagine that we have a stack of identical copies of the New York Times newspaper. Imagine that we placed this stack of newspapers on top of some dynamite and then we light the fuse. This is of course a hypothetical scenario, so please don't try this at home. Eventually, the dynamite explodes. All that's left of our newspapers are tiny charred pieces. Our question is the following. Can we somehow utilize those tiny charred pieces of the newspapers in order to piece together what the original newspaper said? We call this conundrum the newspaper problem. Because we had multiple copies of the newspaper and because we definitely lost some pieces when it exploded, we can't just piece together the small pieces like a jigsaw puzzle. Instead we need to use overlapping fragments of different copies of the newspaper to try to piece things together. You're likely wondering how in the world is the newspaper problem at all related to genome sequencing? Well, biology's lack the technology to just read the genome of an organism from start to end like you would read a book. So instead we can use multiple copies of the same exact genome. And just like how we blew up the newspapers in our newspaper problem, we can randomly fragment these multiple copies into small chunks. A DNA sequencer is able to read these small fragments or reads, as we call them. Much like the newspaper problem in which a lot of the small fragments fizzled away in the explosion, we don't actually get to observe every single possible small fragment. We don't get to see every single read. As we mentioned, sequencing technologies are able to read small snippets of the genome. So, how can we reconstruct the original genome from all these small snippets? Given the sequences of our reads much like in the newspaper problem, our goal is to use overlapping segments in order to reconstruct or assemble the original genome from which those reads came. This is called the genome assembly problem. For this analysis, we will be using SPAdes, which is a popular genome assembler from the Center for Algorithmic Biotechnology from St. Petersburg State University. You can follow the instructions in our course to see how to run SPAdes on a viral data set. In this course, we will only be teaching you how to use SPAdes. We will not be teaching you the underlying algorithms behind how genome assembly tools work. If you're interested in learning about those assembly algorithms that are working behind the scenes, feel free to read the genome assembly chapter of the textbook Bioinformatics Algorithms, An Active Learning Approach. When we introduced the genome assembly problem to you earlier, we slightly misled you. Specifically, we made it seem as though the output of a genome assembly tool is a single assembled genome. But in reality, the output is a little bit more complicated. In the newspaper problem, there were numerous potential issues that we could have faced. Some of the pieces of the newspaper might have been ambiguous. Some of the fragments that we obtained might have just been unreadable or perhaps some of those small fragments just flew away in the winter. It turns out that the genome assembly problem faces very similar issues. Our sequencing experiment may have failed to obtain sequences of all the chunks of the genome and also our sequencing technologies have inherent errors within. As a result, as opposed to outputting a single assemble genome sequence, assembly tools generally output contigs and scaffolds. A contig is a continuous segment of the genome that the assembler tool feels it assembled correctly from start to finish. A scaffold, on the other hand is a sequenced order of contigs with gaps between them. The order of the contigs and the scaffold of a correctly assembled genome correspond to the relative order in the original genome. Existing assemblers are able to estimate the space between the contigs of a scaffold, okay? So we've run our genome assembly tool and now we have our contigs and our scaffolds, what do we do next? How can we assess the quality of our genome assembly? Let's imagine that these five colored line segments are contigs that we've assembled from some viral genome. Intuitively, a perfect assembly would just have a single contig, whose length is the full length of the viral genome. It just sequences the viral genome from start to end. Any deviation from this, so more shorter contigs would be bad. Thus, one intuitive metric for assessing assembly quality is the N50 metric. Let's denote total contig length of t to be the total length, or the sum of lengths of all contigs with the length greater than or equal to t. Using this definition, we can now define the N50 metric to be the length of the longest contig, such that total contig length of t is at least 50% of the total contig length. The N50 corresponds to 50%, so 50% of the total length. This probably still sounds complicated, but let's work through an example. Let's imagine that our five contigs have the following links 70, 60, 30, 20 and 10. The total length or the sum of the length of the contig would therefore be 190. 50% of the total length would therefore be 95. Let's first consider the longest contig, its length is 70. Therefore, total contig length of 70 is just 70. This is less than 95, which means we haven't finished yet, okay? So the 70 length contig length wasn't enough. Let's consider the next contig now. So this one has length 60, which means that total contig length of 60 is 60 plus 70, which is 130. This is greater than or equal to 95, which means we've succeeded, we've now exceeded 50% of the total length. Therefore, the N50 of this assembly would be 60. When I computed N50, I define the total length of my assembly to be the sum of the lengths of my contigs. What if, however, the true genome length was known? In this example, I've drawn the true genome as this green line. If we have an estimate of the true genome length, we can instead compute the NG 50 metric. It's essentially the same as the N50 metric but instead of comparing against the sum of the contig length, we are comparing against our estimated length of the genome. Let's work through another example. Here let's imagine that the true genome length is 270, 50% of this is 135. So the first contig was clearly too short. That was even the case in the N50 example, which was an even smaller length than this one. Let's just jump straight to the second contig length. The second contig is length 60, which means that the total contig length of 60 is 60 plus 70, which is 130. This is less than half of my total genome length which was 135, so I'm not done yet. Let me now consider the next longest contig which has length 30, total contig of 30 will be 30 plus 60 plus 70, which is greater than equal to 135. Therefore, I'm done, I've succeeded. This means that the NG50 of this assembly would be 30. For this analysis, we will be using a tool called QUAST to compute various quality metrics of our assembly. This is one example figure that QUAST can produce. It's called the contig length plot. And basically the vertical axis depicts contig length and the horizontal axis depicts the proportion length of the total sum of the contig lengths. So here I've overlay the contigs from the example assembly that we've been working with. Again, I've ordered the contig in decreasing order length from left to right. The vertical axis depicts the length of the contig and the width of each horizontal line segment is the proportion of the total contig length that this contig takes up. In general, the larger the area under this plot, the better our assembly is. You can follow the instruction in this course to learn how to run QUAST on our genome assembly, okay? So now we've assembled our viral genome, and we've assessed the quality of the viral genome. Now, what do we do? How do we actually analyze it? How do we learn from it? As a thought experiment, pretend that you're one of the scientists on the first team that sequence SARS COV 2. You just sequence the very first genome of this unknown disease causing virus. A natural first question you probably would have is the following. Out of all viruses that are currently known which viruses are most closely related to this new, unknown virus? To answer this question, you will be able to use a tool called BLAST, which is effectively the Google search of bioinformatics. Given a query sequence, we can BLAST it to find all other sequences in a known database that are most closely related to that sequence. Here's an example of what the results of a BLAST search would look like if we searched for the assembled SARS COV 2 genome. In order to better simulate what it would look like for those original researchers, that first sequence, the SARS COV 2 genome we've omitted, any sequences in the nucleotide database that were obtained after December, 2019. You'll notice that most of the top matches are to SARS coronaviruses in humans or from SARS like coronaviruses in bats. This would tell us that our newly assembled viral genome is most closely related to SARS. We've come a long way since Dr. Lee first reported a patient with SARS like symptoms. We've isolated the virus's genome, produced a high quality assembly and determined that it's highly similar to but not the same as the deadly virus that caused the SARS outbreak of 2003. What do we do next? How does this newly assembled genome help us analyze the outbreak? Indeed, there are many questions that remain unanswered. We just found the closest known relative to our virus and BLAST was able to tell us how the two viruses are similar in terms of nucleotide sequence. But in what ways are they similar from a disease mechanism perspective? Or perhaps more importantly, how are they different? At this point, we've assembled the genome sequence of SARS COV 2, but answering these questions will require us to look much more closely at these roughly 30,000 nucleotides of our assembly. By deciphering the secrets contained within this genome sequence, we will uncover the secrets behind how this virus invades our cells and we will design a diagnostic test to aid public health officials in tracking the spread of the virus. We hope you'll join us in the next chapter of our journey.