Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. Let's continue our second week's topic: Sequence Alignment. Last time we have used an example to illustrate some basic ideas in pairwise sequence alignment. In this unit, we will discuss how to use algorithms to construct pairwise sequence alignments efficiently. First, let's clarify some definitions. Let me point out that the English verb "align" and the noun "alignment" are translated into the same Chinese word (“比对”). We will stick to this translation most of the time except for special situations when we will emphasize their differences and use "比对过程" and "比对结果" to denote "align" and "alignment", respectively. The input data forpairwise sequence alignment are two sequences S1 and S2. The parameter to specify is a scoring function f that quantifies the quality of an alignment. The output is the optimal alignment between the two sequences one that maximizes the scoring function. Now some of you, especially those whose background is mathematics or some other computational field, may have immediately realized a simple solution: enumeration! There are a finite number of possible alignments between two given sequences. The scoring function can assign a unique score to each alignment. Thus in principle we only need to find the alignment that has the maximum score. But how big is this finite number? Let's take a look at an example. Let's start with two sequences of equal length. As you can see, new alignments can be generated by inserting gaps of different lengths at any positions in the two sequences. The positions of the gaps uniquely specify the alignment. In the extreme case all the residues align to gaps, and the length of alignment is twice as long as that of the sequence Therefore, any alignment of two sequences with equal lengths of n can be generated by inserting gaps into n positions among all possible 2n positions. Thus, the total number of all possible alignments is equal to the number of all possible combinations of choosing n elements from 2n. By definition, this number is equal to the factorial of 2n divided by the square of the factorial of n. So enumeration seems feasible¬タᆭ at least theoretically. Because there are a finite number of possible alignments between two given sequences. As mentioned earlier, the scoring function can assign a unique score to each alignment. So enumeration seems feasible at least theoretically. Assume that both sequences have a length of 300. We fill them into this formula. Then it will be (2*300)!/(300!)^2 . the number of all possible alignments will be about 10^88 (10 raised to the 88th power)! How big is 10^88? It is about a hundred million (10^8) times as big as the total number of all atoms in the universe Therefore enumeration is only theoretically feasible but practically too slow to be feasible. We need to find faster algorithms that are feasible in real practice. So what could be a faster algorithm? From what we have discussed before, we could see that there are only two possibilities for a single residue to be aligned: the residue is aligned to another residue, or a gap. Only three possibilities for a pair of residues to be aligned: the two residues are aligned together, or one of them is aligned to a gap. Also from the previous unit you can see that The score of the final alignment is just the sum of all the alignment score for each residue. These bring us to an important conclusion that the best alignment up to a certain position is the best alignment for all previous residues plus the best alignment for the current position! Students who major in computer sciences might have already realized which algorithm can be applied here Yes, it's dynamic programming! Dynamic programming is a computer algorithm for optimization problems with the optimal substructure property The optimal substructure property means that the global optimal solution is a "combination" of local optimal solutions. In other words, locally best + locally best = globally best. More specifically, the basic idea of dynamic programming is to divide a complex big problem into many sub-problems which are solved one by one. The optimal solutions of the sub-problems are then combined into the optimal solution for the initial complex big problem. Still sounds a little too abstract? Don‘t worry. Next we will show you how dynamic programming can be applied to our sequence alignment problem. First let's divide the problem into sub-problems. As we mentioned earlier there are only three possible alignments for a given pair of residues. Therefore, we can get the local best alignment of a pair of residues simply by comparing the scores of these three alignments. Adding this to the best alignment produced prior to this position will give us the best alignment up to this position. In other words, locally best + locally best = globally best. Let's write down the formula. Let F(i,j) represent the score of the best alignment between the subsequence of the first sequence X from the first base to the i-th base and the subsequence of the second sequence Y from the first base to the j-th base. Then F(i,j) would be this formula. Here, the first line of the "max" formula means that residue Xi is aligned to residue Yj. The second line means that residue Xi is aligned to a gap, and that's why the leading "F" term is F(i-1,j). Because [X]i is aligned to a gap. Similarly, the third line means that Yj is aligned to a gap and the leading "F" term is F(i,j-1). You can see that the previous F is F(i,j-1). To simplify the problem let us assign the same penalty score for opening a gap and extending a gap. To make it easier to understand, let's put the sequences of X and Y on the x- and y-axis of a matrix. just like the matrix at the lower-left corner. The i and j in F(i,j) become the subscripts of the elements in this matrix. The recursion of this formula then becomes filling the blanks one by one from top to bottom and left to right. In fact, this matrix is called the dynamic programming matrix. OK, we have introduced the basic principles. Now let's give it a try! We will use nucleotide sequences for demonstration since the substitution matrix of nucleotides are much smaller and thus the search of a substitution score is much simpler than that of protein sequences. To save time we will use two very short sequences with a length of three and assign a penalty of five to both opening and extending a gap. We can see that following the formula above the score will be -20. OK, now let's put the two sequences into the first row and first column of the dynamic programming matrix. Let's fill the (0,0) cell in the upper-left corner with 0 according to our previous formula. Yes, the first line of the formula, F(0,0)=0. Then we will fill in the rest one by one. First, the leftmost column and the topmost row. Why both of them are filled with -5, -10, -15? Because these cells correspond to gaps with a length of 1, 2, and 3, respectively, and the gap penalty, d, is -5. So -5*1=-5 and -5*2=-10 . OK, now let's continue to fill in the cells at the center. This seems a little more complex. But don't worry, the principles stay the same. First let's look at the F(1,1) cell with the red circle. First as the formula implies, the score for this cell, i.e. F(1,1), is only related to its immediate left, upper, and upper-left cells (F(1,0), F(0,1), and F(0,0), respectively). Therefore, we could just ignore other cells for now. According to the formula, there are only three possible scores, which are: -5 + (-5) = -10 if coming from F(1,0), 0 + 2 (a substitution of A with A) = 2 if coming from F(0,0), and -5 + (-5) = -10 if coming from F(0,1), among which 2 is the largest. Therefore, we assign 2 as the score of F(1,1) cell, and add an arrow pointing to F(0,0) to indicate where the best alignment prior to the current position comes from. Now let's look at F(2,1). Why does it have two arrows? The reason is that the 2 + (-5) from F(1,1) is equal to the (-5) + 2 from F(1,0). Both are equally to -3 and larger than (-10) + (-5) = (-15) from F(2,0). That is why there is only one maximum score, -3, but two "sources"for getting this score. Now let's look at the endpoint of the recursion, F(3,3). Its score, -6, indicates the score of the best alignment for our two input sequences. The score of the best alignment is -6. The recursion is finished. Terrific! Now we only need to trace back from the endpoint (which is F(3,3)). We can get the final alignment by following the arrows one by one. Note that there are two arrows for F(2,1), and we need to trace them separately to get two alignments. The two alignments still differ in the align of the first A in the second sequence. So which is better? Which is the optimal alignment? Actually, both are just as good! They receive the same score (-6) according to the scoring function we have defined, so they are equivalent and both are the optimal alignments. Now we have finished a basic introduction of how to construct pairwise sequence alignment by dynamic programming. Here are some summary questions. Please note that these questions are not part of the assignment but we hope that you can spend some time thinking about them and discussing them with other students and TAs in the online forum. Thank you. Let's meet at the next unit!