There are many different faces of sequence comparison in biology, and we will now learn some of them. We'll ask first the question what is wrong with the naive scoring models that we use for the longest common subsequence problem when we scored all matches by one. We saw that a biologically adequate alignment of adenylation domains constructed by Marahiel's, but this alignment is not The optimal longest common subsequence of this adenylation domains. At the bottom of this slide, there is actually a longest common subsequence. It has long matches, but it is biologically completely wrong. So the question arises, how can we modify it? The scoring for computing alignments essentially that we avoid frivolous matches, that you can see in the bottom alignment on this slide. In the current primitive scoring, we simply compute the score as the number of matches. Let's change it. And let's also take into account penalties for mismatches and insertion and deletions. So we give premium one to every match, and in our alignment game we now give penalty Mu to every mismatch and penalty Sigma to every indel. And as a result the score in our alignment game changes. Before it was 4, now it is -7. How to find an optimal solution of the alignment gain and optimal alignment under this model. In this case we essentially constructed the scoring matrix, which is a five by five matrix, which describes the score for matching every two symbols in the extended alphabet, which consists of nucleotides plus the space symbol. And we can design whatever arbitrary scoring matrices. For example, I designed an arbitrary matrix here and we can use it to play the alignment game. In fact, biologists invest a lot of effort into designing adequate scoring matrices. Particularly scoring matrices for amino acid sequences. And the goal of these scoring matrices is to reflect the mutation propensity of different amino acids. For example, amino acid Y often mutates into F and that's why it gets high score, +7, but rarely mutates in some other amino acids for example, prolyene, and in this case it gets actually penalty, -5. And this is an example of a scoring matrix that biologists use. Now, in the case we work with scoring matrices, how our dynamic programming recurrence change? So for the recurrence shown on this slide, we simply have the following recurrence, S(i,j) equals to four different possibilities depending on whether we are computing score for insertion, deletion, match or mismatch as shown on this slide and the scores of the edges in the alignment graph change accordingly as shown on this slide. Or alternatively, for a very general scoring matrix we simply can write a three term the currency where green, blue, and red alternatives correspond to vertical, horizontal, and diagonal edges. And the global alignment problem that we want to solve is the following one. Given strings v and w and a matrix score, we want to find an alignment of this string whose alignment score as defined by the scoring matrix is maximum among all possible alignments of these strings. The global alignment is a good model for some biological sequence comparison problems, but a bad model for some others. And I'll give you an example of Homeobox genes to illustrate the challenges of biological sequence comparison. Two genes in different species. Maybe similar over short conserved regions and dissimilar over remaining regions. For example, Homeobox genes have short regions called the homeodomain that is highly conserved among species varying from human to fly. But global alignment of homeobox genes would not reveal homeodomain, because it would most likely pass through a completely arbitrary region of the sequences, since homeodomains are short subsegments of homeobox genes. How can we find this important biological similarity that, however, do not expand over the entire length of sequences, and thus in the case of search for the short sequences, the global alignment fails. At this slide, you see two alignments. And the question arises, which alignment is better? The alignment on the top actually has a higher score. But the alignment at the bottom has lower score, but is more biologically relevant, because it shows a very strong match of short sequences. How can we find this alignment despite the fact that global alignment may miss and search for such short segments within sequences that exhibit similarity, is called "local alignment problem". So in this case, there are two possible alignments in the alignment graph. The alignment on the top is biologically correct, but the alignment in the middle is actually a random alignment that however has a higher score from the perspective of global alignment, and therefore it hides from us the biologically relevant alignment. So what I want to do, is to somehow find these short substrings of the entire strings, that exhibit high similarity. How do I do this? There is a very simple way to search for short similar strings, these are longer strings. You can simply try all possible pair of strings from two sequences and each such pair correspond to a rectangle in the alignment graph. Here's one of the rectangles. But there are so many such rectangles that this approach of course becomes impractical, since search for optimal global alignment within each smaller rectangle requires quadratic time, and therefore overall the running time will become very large. What can we do to come up with a practical local alignment algorithm? The first thing we need to do is to formulate the local alignment problem. The input is strings v and w, and a scoring matrix score, and the output is a substring of the entire strings v and w, whose global alignment, as defined by score, is maximal among all global alignments of all sub-strings of v and w. My proposal for solving this problem: let's introduce free taxi rides through the alignment graph. Indeed. If we were able to start from the source and travel freely to the start of the the conserved fragment, and then take another free taxi ride from the end of the conserved fragment. To the destination, final node of the alignment graph, then we will be able to score this interesting segment by taking zero cost of a taxi ride to the beginning of this fragment, then the real cost of the alignment of the fragment, and then plus another zero, which is the cost for another taxi ride. You may ask how is the world we can take taxi rides through the alignment graph. Well, the whole point of introducing this concept of Manhattan-kind cities and travelling in them, is that we are free to build whatever Manhattan-like grids. For solving hard biological problems. And in this case, what is a free taxi ride? It's simply any extra edges of weight zero to our alignment graph. And since we are free to build whatever Manhattan we want, we can, of course, we're at liberty of introducing this taxi ride. So let's see how our graph changes. What do we need to do to implement this free taxi ride? We need to add edges from the source to any other node and it will be roughly quadratic number of edges. We also need to add edges from every node to the sink, once again quadratic number of edges. So the number of edges in the graph remains quadratic, and therefore our algorithm will be fast. And in the end how our dynamic programming recurrence changes for the local alignment. Before we had three possibilities corresponding to three ways to enter a node. By a vertical edge, by a horizontal edge and by a diagonal edge. Now there is one more possibility. We can take a free taxi ride to the node, so now there are four possibilities for entering every node. Which means that we need to add the forth term in this recurrency which is the weight of an edge from (0,0) to (i,j) And the weight of this edge since our taxi rides are free is zero. And that's the only change that we need to implement to make our local alignment algorithm practical and fast. And we now move to the problem of defining adequate insertion and deletion penalties in sequence alignment.