So we will now talk about algorithmic thinking and how we apply it to our very important problem in, in biology called sequence alignment. So I am going to illustrate what's sequence alignment is and we will talk ab, a little briefly, about where we are heading with this problem in terms of algorithmic solution. So the issue of, or the problem of, of sequence alignment relates to the notion of evolution and evolution of sequences from a common ancestor. So let's just think about a case where, you know, we have two DNA sequences here that evolved for some time from a sequence that existed at some point in the past. Let's assume that we are looking at DNA sequences, and the sequence in the past was ACTAC. This was the sequence, for example, at the ancestor of human and chimp. Okay? This sequence is evolving. You know after human and chimp split from a common ancestor this sequence started evolving independently along each one of these branches or edges. Now what happens when the sequence is evolving over time, mutations happen. The mutations, there are two, at least two types of mutations at least the ones that we're going to be focusing on. One type of mutation is, is that, is the type of mutation that changes a letter. As this sequence is evolving over time, some letters change. Okay? They go from A to C to D to G or whatever. This is known as base pair substitution. A letter changes. Another type of mutation that happens is called insertion or deletion n ndil. In short, where some parts of the DNA can get deleted or parts of DNA can get inserted. So let me illustrate. Let me illustrate with this, with this example here suppose as this sequence was evolving the first letter changed to G mutated, and the second letter got lost, okay? So what, what happens here is that as this changes to G and this gets lost this sequence here becomes G D A C, okay? The A changed to G. The C got lost, it's a type of mutation. T A C, no changes happened to them, okay? For this sequence as it was evolving here along this branch towards the chimp species, for example, the third letter evolved to A and the fourth letter change to G. This becomes ACAGC. In this case here, no letters got deleted, no letters inserted, but these two letters changed. Okay? The A became a G. Now, as you know, we have access today with, with sequences I can easily get accesses to sequences from human geno like this. I can easily get access to DNA material to Chimp. I don’t have access to this. Okay? This existed millions of years ago, we don't have access to it. So our task in se, in the sequence alignment problem is that, presented with these two sequence and knowing that they came from some common answer here. I want to try and reverse engineer this process and know what our limitations or how these two sequences can have different sequence content, and different length, how they came from a common ancestor. Okay? So again to illustrate now when I have the problem, the data that we are presented basically have this now. These are the only things that we can access or observe data, okay? So now the problem becomes, that I have these two sequences X and Y and the output here I want them. An alignment of X and Y. Okay? Now what do I mean by an alignment? Okay? So an alignment of these two sequences basically means that I want to make them both of the same length. Okay? I want to put them on top of each other, that they are of the same length. Now, how can I make these two sequences of the same length. They are not, this has four letters, this has five letters. So we define one operation where you can insert in either of them or in both of them, dashes. Okay? So one way of aligning these two sequences, for example, is that, if we, let's call this x and this y. One possible alignment is, this is an alignment, because now I made the two sequences of equal length by adding a dash in the beginning. Of course there are many more alignments, right? I can do this. This is another alignment. And I can add more alignments because I can also insert dashes in the second one. I can say. G dash dash T A C. A C A dash. This is another alignment, okay? And, there are many more alignments, okay? This version of this alignment problem where I'm making the two sequences in there entirety of equal length. This is known as the global alignment problem. [SOUND] Okay? So this is, we are globally aligning both sequences. So I want to make, I have two sequences of the same length or of varying lengths, I want to make them of the same length by inserting dashes and then so I get something like this. Now, there are two questions that we need to answer here algorithmically or when part of algorithmic thinking. First of all, what defines an alignment? Okay, this is an alignment, this is an alignment, this is an alignment. Is the following an alignment? Is this an alignment? Of course, I added dashes to both of them. They are of the same length. This has 2, 4, 6, 2, 4, 6, they have the same length. Is this an alignment? And the answer is no, this is not an alignment, okay? So what is it that we didn't like about this and we like about these? The fact that there are two dashes here next to each other is something we don't want, we don't allow in an alignment, okay? So the first question we need to answer is that what defines an alignment and what doesn't define an alignment. And the second question is that once I have set these rules of feasibility, or criteria of feasibility, what defines an alignment, how do I find it? Okay, these are all feasible alignments, based on the set of rules that I will define. Why should I, should I define, should I give the biologist this alignment, or this, or this, or which one of the many more alignments that exist? Okay? So we have to talk about two parts to the problem. One if feasibility. What defines a feasible alignment? For example this is feasible alignment, this is not feasible. The second one is optimality. What defines an optimal alignment or an alignment that is better under some criterion than another one? Okay? So when we talk about, I will talk first about feasibility. What defines an alignment? To define an alignment, global alignment, we have three conditions. First is that for an, for, for, for something to be an alignment the two sequences have to be of exactly the same length. This satisfies it, this satisfies it, this satisfies it. Of course this satisfied it as well, okay? So the first condition is that they have to be of the same length, okay? The second condition is that we don't allow a dash in one to go with a dash in the second. This, this and this, they satisfied that condition, but this one does not satisfy this condition because we put the dash here. In both of them, okay? So the, the first one, they are of the same length. The second one, we don't allow a dash in both of them at the same position. The third one, if I remove the dashes from the sequences in the alignment, I should get back the sequences that I started with. So if you remove the dash from here, you'll get back to GTAC, which is this. If you remove the dashes from here you go back to this. The same thing here, the same thing here, you remove these 2 dashes, you go back to GTAC, remove the dash from here you go back to the original sequence. Okay? So, for, for, for two sequences to be in, X point and Y point to be in alignment with X and Y,. They have to satisfy three conditions. The first one is that they are of the same length. The second one is that we don't have dashes in the same position. The third one is that if I removed the dashes I get back my sequences. Okay? Now that we have defined feasibility, you can see that I've already listed three possibilities here but I can list more and more of these. Okay? So you will see that the problem becomes very hard, is that how do I go through all of them and how do I present them to a biologist and say, here's a list of all feasible alignment, make your choice. It doesn't make sense. So the second thing is that we have to mathematically define a criterion that says, this is better than this, for example, so that when I give a solution to the biologist, I say, under this criteria, under this mathematical definition, this is most often more than that. Okay? To do that, we have to define, what we call, a, a, a scoring scheme, or a scoring matrix. What is a scoring scheme or scoring matrix? Let me illustrate here, we can leave these two alignments here. And in the case of D and A, because we have four letters, notice what happens in an alignment. Either a D and A letter goes with a D and A letter or a D and A letter goes with a dash, right? Dash in the first sequence or dash in the second sequence. So, to define I will try to give a score to each one of these configuration. Either I have a nucleotide with nucleotide or a dash with nucleotide, or nucleotide with a dash, and so on. The easiest way to define this is by defining what we call a scoring matrix. Since we are dealing with DNA, four letters, plus the dash, that's five, it becomes a 5 by 5 matrix. And the 5 by 5 matrix, basically corresponds to the DNA letters and the dash. So at every entry in this scoring matrix that I'm going to call M is going to be the, the score, of matching two letters, okay? For example, if I said ten here it means that every time I match A in one sequence within another sequence is going to be of score ten. If I put, for example, dash with C here, for example minus 4, it means that if I put a dash in the first sequence, with a C in the second this is going to have a score of minus four. And so on. So I can fill up this matrix, for example, like this and, and, and every time I have a dash I'm going to put minus 4, and every time I have a mismatch, a letter goes with another letter but of different type I'm going to give it, for example, score of 4. Okay? So in this case, and suppose that the matrix is symmetric, so, this is going to be 4, and this going to be 4 and so on. In this case, now I can define the score of an alignment, by going through the alignment on position at a time, and for every position I look up the score of that. Configuration from the table. So here I have G goes with an A. What is the score of G going with an A? It's G with A. It's this one. It's four. The second one is a dash with a C. Dash with a C, it's minus 4. So this is minus 4. T with A, it's 4. A with G is 4. And C with C is 10. Okay? So I look at these, the sum of these scores, 10, 14, 18, 14, 18, I would say that the score of this is 18. I can repeat the process with this, here. This is G with A has a score of 4. Dash with C is minus 4. Dash with A is minus 4. D with dash is minus 4. A with G is 4. C with C is 10. So if we have 10 here, we'll say 10, 14 ,10, 6, 2, 6. Right? So the score of this is going to be 6, okay? Now, if, since we are defining a score here, it will make sense to say that I will be looking for or seeking the alignment that has the better score. So in this, from, between these two, I will be basically saying this alignment has a higher score than this alignment, and that this scoring scheme that I defined by this matrix therefore I will be returning, between these two I'll be returning this alignment to the biologist. Okay? So now we defined this criteria that tells us what is a feasible alignment. Under this criteria, we can have many possible feasible alignments. Second thing, we define a scoring scheme that's given by this matrix. So that I can score every matrix, what is, what score it has, every, every alignment, what score it has, then out of all the possible feasible alignments, I will choose the one that has the highest score. Okay? Now of course you will see that it's not possible, it's not feasible in practice to list every possible feasible alignment, score them, and then return the one with the highest score, which would be a brute force algorithm. Instead, we can use an algorithmic technique that we have learned, which is dynamic programming, that will compute the optimal alignment and it's score, in, in a very efficient amount time. Okay?