So now we have a really efficient method of doing multiple pattern matching using the Burrows-Wheeler transform but I want us to recall that finding exact pattern matches is not what our original goal was. We want to find places where your individual genome has variants when compared to the reference genome. So we need to look for inexact matching in order to find these variants. So if we have a single pattern representing a single read, then we want to find, all positions in the genome where this pattern occurs with say at most d mismatches. And this is a problem that we saw at the first part of the course, the approximate pattern matching problem. Now, again, we simply want to generalize this to the case where we have multiple patterns, a collection of reads, and we want to match each of these against the genome to find where they occur with at most d mismatches. There are going to be two methods I'll show you, to show you kind of the highlights of how you could extend the methods that we already have for exact pattern matching towards approximate pattern matching. So the first one is say that we ha, know that, a pattern appears in a genome with exactly one mismatch. So here's a pattern in a hypothetical genome and here's a mismatch. If you divide the pattern into two pieces, then notice that one side is going to have to match the genome exactly, so here the mismatch was on the left side of the partition and to the right side has to match exactly. If the mismatch were on the right side, the left side would have to match exactly. In general if you have a pattern that occurs in the genome with say d mismatches, then you're going to be able to divide that pattern into d plus 1, approximately equal pieces and then you're going to be able to find at least one of those pieces that matches the genome exactly there. So to give you an example, here's a, a string of, I just wrote X's, of length 35, and say that we have six mismatches with the genome. So, I've labelled just random locations where it could mismatch, and if you partition this into seven pieces. then, we have six mismatches, partition these strings into seven pieces of equal length here, it would be length five. And then notice, then, that, you have a mismatch in every one of these windows except for this one, okay. And, so here we're going to have an exact match with the genome. And so this is going to be generalizable. So that whenever you have d mismatches, if you divide this pattern into d plus one pieces, you'll have at least one of these smaller strings, exactly match the genome. So, this sets up a method called seeding. If you're looking for d mismatches, you can divide each of your pattern strings into d plus one, smaller, approximately equal pieces and these pieces are called seeds. And then you look for exact matches of those seeds with your genome. Once you find those exact matches, it narrows down what you have to consider a lot, so you can throw out a lot of your patterns and then, look for inexact matches based off of those seed matches. So, if you say we did find a seed match. Well, let's maybe look at that location and see, and count the number of, of inexact matches or mismatches that we see. So that's really just a very straightforward way of thinking about, pattern matching. Another way would be to integrate a direct Burrows-Wheeler approach. So I'll show you an example of how this might work. Recall that when we, when we looked at pattern matching before with Burrows-Wheeler, we search for ana in Panama bananas. And so we started by saying, let's walk backwards through this pattern. So we started with finding the a's and we walked backwards to the other column of the Burrows-Wheeler matrix. And before we said, okay we could find three n's, so we were walking backward in the pattern. So we had an a and then an n and we would immediately eliminate the m, the p and the b because they didn't match the n that we were looking for. So, if we were allowing, say, one mismatch, or at most one mismatch, then we need to keep these red letters around, all right? We need to hold on to them because it may be the case that we can continue extending, and get, exact matches everywhere else and then we would have an inexact match with one mismatch. So what we'll need to do then is we'll need to keep track of the number of mismatches that we've encountered along the way. So here we've encountered zero mismatches, when we've matched these n's, because we have two matched symbols. Here we've encountered one mismatched set, the m, the p and the b. And then because we haven't, exceeded one mismatch, we'll need to extend all these strings. Okay? So, we'll use the first last property and take all of, find where each of these elements occurs back in the first column. And then we'll say, well, we know that the a is going to be there next. Okay? Now we carry the mismatches down and we see, okay, we have one mismatch here, here, and here. And then we need to walk backwards again towards the next symbol in the pattern string ana. Here we see that we have five occurrences of a, all right? So we're able to extend each of these five, pattern candidates backwards and, and have a match of a. In this case we walk backwards to the last column of the Burrows-Wheeler matrix and we hit the dollar sign, so that says that we've essentially wrapped around the string or you can think of it as a mismatch. But we've wrapped around the string and so now we need to eliminate, this, okay. So, we now have, when we go, use the first last property and go backwards with these five matches we get five inexact matches with, at most, one mismatch in our string. So, we've kept track that we have one mismatch here, one mismatch here, and then we have the three perfect matches that we found before. So then, we say where are they? Well, we use a suffix array or a partial suffix array, we already know how to find where these mismatch, where these, inexact matches occurred, we know that they occurred at position, starting position five. That's, a match, one mismatch. Starting position three we can find ama which has one mismatch and then starting positions one, seven and nine give us the exact, matches that we saw before. So, just being able to solve the exact multiple pattern matching problem is going to be able to give us. We now have this huge arsenal of power using Burrowsâˆ’Wheeler that we're able to easily step forward and deal with inexact matches now.