So, in this section we are going to talk about genome assembly using read pairs. Pavel gave great lectures explaining how to construct de Bruijn graph from reads. And from the obtained de Bruijn graph he explained how to find an Eulerian path in this graph to identify the genome. But he forgot to tell you one important thing. There can be multiple Eulerian paths in the de Bruijn graph. Even in this single de Bruijn graph, there are two Eulerian paths in this graph. Here they are. And these two Eulerian paths correspond to two different strings. Which string are you going to give to biologists, the left one, or the right one, or both? Please note that we sequenced only one genome. Well, this ambiguity forces us to decompose the de Bruijn graph into multiple disconnected components. Each component corresponds to a non-branching path in the de Bruijn graph. Please note that while there are multiple Eulerian paths in the de Bruijn graph, all of them share the same set of non-branching paths shown here. These non-branching paths in their turn correspond to contigs, which are substrings in the genome. Well, ideally, biologists would like to have one contig for each assembly. Or at least, one should significantly reduce the number of contigs. To obtain this goal. they invented the technology called read pair sequencing that I'm going to talk about. Given multiple identical copies of the same genome, we now randomly cut these genomes into large, equally sized fragments of size insertLength. Now for each fragment we generate pair of reads from both ends of each fragment. So now instead of talking about reads individually, we are talking about pairs of reads separated by a distance called insert size. So, given two k-mers, if they are at a fixed distance d apart in the genome, we call them a paired k-mer. So the question is, how to construct the genome, not from the k-mer composition, like previously described, but from paired k-mer composition. But what is paired k-mer composition? Given a string, a paired k-mer composition of a string is a set of all paired k-mers generated from this string. Well, because we don't have space, let's represent a pair of k-mers as a two-line expression like this. Okay? We can order these paired k-mers in this order, which is the order that they were generated from the genome. But please note that the position information is not available. So a correct way to represent the paired k-mer composition should be in lexicographic order, like in a dictionary. So the question now is, can you construct the string given its paired k-mer composition. Or let me describe formally. Input is a collection of paired k-mers. Output is a string text such that the paired k-mer composition of text is equal to the collection of paired k-mers. How would you solve this? Let's ask de Bruijn how would he solve this problem. Well, I think that de Bruijn would came up with the same approach that I am going to describe right now, which is the paired de Bruijn graph. Given a genome and its paired k-mer composition, it's clear that the genome can be represented as a path, where the edges are labeled by paired k-mers, and the nodes can be labeled by pairs of (k-1)-mers. And this path, we call it genome path. But allow me to do one crazy operation with this genome path. Let me glue nodes with identical labels together. In this slide, we have two nodes with identical label, TG|AT. We glue them. We glue, and this graph is called the paired de Bruijn graph from the genome. Now, you may guess that the next question I'm going to ask is, can you construct the paired de Bruijn graph, not from the genome, but from the paired k-mer composition of the genome. Well, the answer is ..., let's see how we solve this problem. Let's represent each paired k-mer by an edge, from the paired prefix to the paired suffix. Now, let's do the same thing as in de Bruijn graphs. We glue nodes with identical labels together. Here we glue, we glue, continue gluing, continue, continue, continue. What is it? This is the genome path, as if the genome was available. But it's not all. We have something more to glue. TG|AT, TG|AT, they are two nodes with identical labels. We have to glue them, we glue. This is the paired de Bruijn graph from the read pairs. Okay, to recap, the paired de Bruijn graph for a collection of paired k-mers can be constructed by two operations. First, represent every paired k-mer as an edge between its paired prefix and paired suffix. Second, glue all nodes with identical labels. In the paired de Bruijn graph, the genome is an Eulerian walk in the paired de Bruijn graph. But what is the point? In the de Bruijn graph, we also have this property. The genome is also an Eulerian walk in the de Bruijn graph. But please note that in the paired de Bruijn graph, nodes are labeled by pairs of (k-1)-mers, while in the de Bruijn graph, nodes are labeled by just individual (k-1)-mers. Both of these graphs can be obtained by gluing operations on the genome path. But because of the difference in the labeling, paired de Bruijn graph is obtained by fewer gluings and that makes it simpler. In this case you can see that the paired de Bruijn graph is simpler than the de Bruijn graph. Indeed there's only one Eulerian path in the paired de Bruijn graph that allows us to uniquely construct the genome. While in the de Bruijn graph we cannot construct the genome uniquely. In practice, paired de Bruijn graph has been applied in assemblers and it has shown significant advantages in the resulting assembly.