We'll now return to the questions that we asked at the very beginning of this lesson. Starting from genomes, how can we transform them into synteny blocks? And the first thing I want you to learn is the notation of genomic dot plot. Let's arrange genome along x coordinate and along y coordinate. And let's represent every k-mer that is shed at position x and y in the genome as a dot, a position with coordinates x and y. In this case, I show you genomic dot plot for k-mer size equal to 3. And it only consists of the diagonal in the genomic dot plot. Of course, diagonal is always present in the genomic dot plot because k-mers at the coordinate x and x are always the same. But here's the genomic dot plot for k=2, and you can see that in this case the number of dots increased as compared to k=3. For example these 4 dots correspond to repeated occurrences of the 2-mer CT in the genome. So far, we talked about a genome consisting of one strand, but in reality of course genome consists of 2 strands, and therefore we should take care of both identical and reverse complementary k-mers while representing genomic dot plots, as shown here. In this case, red points correspond to identical k-mers, and blue points correspond to reverse complementary k-mers. Now this is how a genomic dot plot looks like when we compare genome against itself. But can we also construct genomic dot plots when comparing two different genomes? And here's an example of comparing one genome with another one that is obtained from the first one by a reversal. And in this case, you can see that a reversal corresponds to the -45 degree diagonal formed by blue points in this genomic dot plot. Now, from looking at the small string, let's now construct genomic dot plot for entire E. Coli and Salmonella genomes. And human eye immediately see synteny blocks in this genomic dot plot, they simply correspond to various diagonals. And we only pay attention to sufficiently long diagonal, because small diagonal, or small points, clusters of points may represent noise in this case. So we see five synteny blocks, five diagonal corresponding to a block marked asa a, b, c, d and e, and if we project these blocks on the x axis we will get an arrangement of synteny blocks in E. Coli. And if we project them on the y-axis then we'll get an arrangement of synteny blocks in salmonella. And now we are ready to analyze the 2-break distance between these two genomes. Therefore, construction of synteny blocks essentially amount to constructing diagonals in the genomic dot plot. And here is the finding synteny blocks problem. The input is the set of points in genomic dot plots in 2D, and the output is a set of diagonals in DotPlot representing synteny blocks. Does it make sense? Of course it doesn't make sense. Because I have not defined what diagonals. Look at this genomic dot plot and you can see the diagonals are not perfect because there are many insertions, deletions, mismatches and many other artifacts, and therefore human eye has the ability to somehow process this genomic dot plot into five diagonals, but we really do not know what our brain is doing while making this transformation. Can we come up with an algorithm for constructing these diagonals? Let's try. Let's look at the genomic dot plot between two different genomes that we constructed before. And let's represent every dot in the dot plot as a node in the graph. And we will form edges in this graph by simply connecting closely located points. For example, points that are located at a distance less than maxDistance, where maxDistance is a parameter. What will happen? We will connect certain points by edges, and you can see from here that diagonals actually correspond to connected components in this graph. Moreover, we probably should ignore very small connected components. For example, a component located in the upper-right (upper-left to the viewer) corner and low left corners (lower-right to the viewer), but the three remaining components should be accepted as synteny blocks in our genome. And therefore, I offer you the following algorithm for solving the problem. Let's form a graph whose node set is the set of points in DotPlot, and connect two nodes by an edge if the corresponding nodes are located at close distance from another. The connected components in the resulting graph define synteny block. And afterwards, delete small synteny blocks. Of course this is not the only algorithm that tries to model what the human eye is doing, while clustering points in 2D, and our brain is very good at clustering. I can give you another algorithm, for example this one that I call Amalgamate. And it actually different, but it looks like it's doing almost the same thing. First step, define each point in DotPlot as a separate block and iteratively amalgamate the resulting block. How do we amalgamate the blocks? We amalgamate two blocks and combine them into a single block if they contain two points that are separated by small distance in another genome, and as before delete small synteny blocks. So we have a problem and we have two algorithms for solving it. Which algorithm would you use? Instead of answering this question, you should ask me what problem I actually want you to solve, because I have never defined an algorithmic problem that I want you to solve. You should have refused coming up with an algorithm before actually seeing the problem I want to address. Unfortunately, in bioinformatics, it's not as simple. I have been working on the problem of synteny block generation for 10 years now, and I failed to formulate this problem as a rigorous algorithmic problem, and that's why we are at mercy of our intuition. And some people may like Synteny algorithm better. Other people may like Amalgamate algorithm better. How would we choose? The choice should be done based on the practical results. And if you implement these algorithms and run them, then you see that synteny algorithm gives a reasonable and biologically adequate result. But amalgamate algorithm, also, it looks reasonable, it looks like it models what human eye may be doing, gives completely wrong results. After developing the algorithm for constructing synteny blocks, we are now ready to find out what are the synteny blocks between human and mouse X chromosome. And here's a genomic dot plot for the X chromosome consisting of 25,000 dots. We form a synteny graph on these 25,000 nodes. And at this slide, you can see the resulting connected components. We further process of these connected components and represent them as a perfect +45 degree or -45 degree diagonals in 2-D. Where are the 11 synteny blocks between human and mouse that we have been considering in this lecture? Well if you project these 11 blocks in 2D to x-axis, you will get the arrangement of human synteny blocks. And if you project it to y-axis, then you will get the arrangement of mouse synteny blocks. Moreover, since we are not particularly interested in the scale of these blocks, and even if some blocks appear small, there are maybe hundreds of genes in each of the blocks. It's just an issue of scale. So let us now represent each block by the diagonal of the same size. And once again, if we project these blocks into 2D to the x axis, we will get an arrangement of synteny blocks in Phillip, and if we project it to y-axis, we will get the arrangement of synteny blocks in mouse. And now, after we constructed synteny blocks for two genomes, How about constructing synteny blocks for three genomes? You can think about generalizing how a synteny block construction approach for more than two genomes, but what about constructing synteny blocks for 100 genomes that are being sequenced now? This actually turns out to be a very difficult problem. And nobody in the world today is able to construct synteny blocks for 100 genomes. Maybe after this course you will suggest a reasonable approach for this. But an interesting thing is that if we learn how to construct synteny blocks for many genomes, we will be able to come up with a very accurate picture of chromosomal organization in our distant ancestor. So there is lot of work ahead in developing genome rearrangement and synteny block construction algorithms, but this is already beyond the scope of this course.