So Sean talked about this crazy method of string compression called Burrows-Wheeler that's going to let us convert the repeats of a genome into runs. So that then we can apply some other compression like run length encoding. But say we do this operation and we compress the genome. It's stored as a smaller file somewhere. But then when we download this genome when we want to work with it, the question is we need someway of decompressing, we need someway of getting the original genome back. And so the point is that if we apply run length in coding say for example, it's easy to decompress that. So if we see 14g, then we can easily turn that into 14 occurrences of g. But what about going this direction? What about taking the Burrows-Wheeler transform of some genome, and reconstructing the original genome. Is that even possible to do? And I think that you can think about this for a second, maybe pause and come back. My thinking would be no. I think it doesn't seem obvious how you would go backwards given this Burrows-Wheeler transformed string. Let's see though if we can do it. So say that we want to reconstruct the original genome banana. I'll use a much smaller example than Panama bananas, and here's the Burrows we would transform of banana. So it's annb$aa, now we're assuming that we have this string, but we don't know the original string banana. How do we reconstruct? So notice that if you know the last column of the Burrows-Wheeler matrix, then we're going to get the first column for free. And the reason is that because the strings are lexicographically sorted. They're all cyclic rotations of banana that have been lexicographically sorted. Then for that reason, the first column is just the last column that's been sorted. So sort the last column, it gives you the first column. You get $, three occurrences of a, b, two occurrences of n. So we get another column for free. The next thing I want you to notice then is that if we have the first column and the last column, remember that each of these rows is a cyclic rotation of banana. So that means that this a occurs directly before the dollar sign in that cyclic rotation of banana. This n occurs directly before the a in that cyclic rotation of banana and so on. And you can see this from the matrix that you can see is grayed out. Write banana, the next symbol is s, $banan. The next symbol is a. So what this is going to tell us as we get 2-mers of the cyclic string banana, for free. So if we're looking at the 2-mer composition of one of these cyclic rotations. We'll get a $, here we'll get na, na and so on. So we get the 2-mer composition of this circular string. If we then sort this 2-mer composition we get the string shown on the right. This is going to give us the first two columns of the Burrows-Wheeler matrix. And the reason is that, remember the Burrows-Wheeler matrix the rows are sorted lexicographically. So if you sort the 2-mers occurring in banana lexicographically, then you should get the first two rows here and I'll highlight that. So the first two columns should be $b, a$ and so on. That matches perfectly the sorted 2-mer composition. Now we've got three columns. Can we get a fourth? So we apply this exact same idea. We say, a must occur directly before $b in the cyclic rotation, so that gives us a$b. N must occur directly before a$, so we know that one of the 3-mers is na$ and so on. So we get the 3-mer composition of a circular string banana. And then we can sort that 3-mer composition to get a collection of 3-mers. Because they're sorted we know that they must correspond to the first three columns of the Burrows-Wheeler matrix. We go again, we know that these formers are a$ba na$b, and so on. We get the 4-mer composition, we sort that 4-mer composition that gives us the first four columns. We use the four columns together with the final column to give us the 5-mer composition of the circular string banana. We sort those 5-mers, that gives us the first five columns and then we do this one last time to get the first six columns of the Burrows-Wheeler matrix. We take all 6-mers, so a comes before $bana, and so on. We then sort those 6-mers and that gives us this first six columns. And so lo and behold, we've managed to reconstruct the entire matrix. Now that we know the entire matrix, we just say, we know that the $ is always going to appear in the top left. Because remember, we assume that the $ is lexicographically first. And now that we've done this, we can say, everything after the dollar sign must be our original string. So because the dollar sign ends the string. So the top row when you spell it out is going to give us banana, and we're done. So we've managed to invert, we started with just the final column. And we were able to reconstruct the entire matrix just from knowing the last column. The issue though is, I know what Sean's going to complain about and that is that if we have to reconstruct the entire matrix. The entire matrix is equal to the length of genome in one direction times the length of genome in the other direction. So this is going to give us genome squared. We're gonnahave quadratic in the length of genome to build this entire huge matrix. To reconstruct what the original genome was, and we can have memory trouble here. So the question is can we invert Burrows-Wheeler? Can we decompress using less space? So let's make a strange observation. Again, remember that if we start with the last column, we get the first column for free. So assume that we have those two columns. But we don't know, we haven't reconstructed the internal parts of the matrix. Now, it's highlighted the six occurrences of a. In the first column, the six occurrences of a in the last column. And I've highlighted where those occur in sort of these cyclic representation of Panama bananas. So if we look at the first a in the first column, and we look at the first a in the last column. Then you'll notice that these two a's correspond to the same position in the string Panama bananas. So it's the third a in Panama bananas. Notice that this is the first a, second a, third a, first a, second a, third a. So this position and this position are the same. Second occurrence of a in the first column, second occurrence of a in the last column. Both of these correspond to the second a in the string. So we have Pana, Pana, and so I've highlighted the a here. And so we can go through and we notice this same thing. The third occurrence of a in the first column, and the third occurrence of a in the last column. Both correspond to the same position in the string, and so on. So we get all six positions, and it's the same. We can look at the ns too. It's not just something that's particular to a, it's for any symbol. So if we look at the ns, the first appearance of n in the last column and the first appearance of n in the first column, they both correspond to this n. And you can verify this for the other two that this property appears to be holding. So the question then is, is this true in general? Before we remember we want to know a faster way of inverting Burrows-Wheeler. But for now I just want to know, does this property hold in general? It's interesting. So let's back go to our [Burrows-Wheeler matrix for Panama bananas. [COUGH] And remember that the strings are sorted. So if we look at the six columns that start with an a, then they're sorted as well. So I've numbered these off. And if we chop off an a, then the rows are still going to be sorted. See, so because they all start with a the remainder of the strings are still sorted. Now if we add this a back to the end, then those strings should still be sorted right because all we did was add an a to the end of all of them. So we've essentially done a cyclic rotation of our six original strings where we just took the a and put it on the end, and those strings remain sorted. So what that means is if we look for those cyclic rotations back in the original Burrows-Wheeler matrix wherever they are. Wherever they wind up being they should still maintain the same ordering. And so that's what saying if this occurrence of a was here, then it has to be the same in the original string as this occurrence of a. And so these correspond one to one, two to two, and so on. How can we state this in general? Well we say the k-th occurrence of any symbol in the FirstColumn and the k-th occurrence of any symbol in the final column of the Burrows-Wheeler matrix. They're going to correspond to the exact same position of that symbol in the original genome. So you can index your occurrences of each symbol in the First and LastColumn. You can index the 6ss here, and the 6ss here, and that means they correspond to the same positions. The only other interesting symbol here is the ns, and we mentioned this before. But we can index n1, n2, n3 and they correspond n1, n2 and n3 in the last column. So that's a really, I think a cool proof and an interesting fact. But as I said the issue would be what does it have to do with Burrows-Wheeler decompression? So let's see. If we started the $ in the top left of the Burrows-Wheeler Matrix, [COUGH] then we know that if we walk backwards in the string. So assume that we know Panama bananas for a second and represent it this way. If we walk backwards in the string, then that corresponds to walking backwards one position here and winding up in the last column. So we know automatically that the position before the $ in the original string, is a s. By this first last property that we've just shown, we know that that s has the same position as this s. In this case there's only one s, but if there were 190 ss, we would know that this s and this s have the same position. So this s corresponds to that s as well. Now we can walk backwards from this s. We know that in the original string, if we go backwards in this cyclic rotation of it then we'll hit the last column element here and that's an a. We've walked backwards from the s and wound up on the other side and we are able to reconstruct an a here before the s. We apply the first last principle again, and say that this is the sixth occurrence of a. And so here you're seeing why it's useful. If we don't have the first last property, we wouldn't know which of these as has the same position as a6 over here. But because of the first last property, we're able to go straight to this a. And say that this a has the same position in the original string. Once we know that, we walk backwards one step and go around the cyclic rotation, and say that the previous element in this cyclic string is an n. So we're able to walk through this procedure, one step at a time. Where we use the first last property and then walk back a step. Use the first last property and then walk back a step and so on. So we're able to sort of reconstruct the original genome in reverse and so you can follow this and we end up with Panama bananas. Again, we know that the $ ends the string, so we go this way. So we've now reconstructed our original string Panama bananas using only the first column and the last column without needing to reconstruct any of the internal elements of this matrix. So if we've used two columns of the matrix then we've used memory approximately equal to two times the length of genome. So we're now at big old genome. For decompressing the Burrows-Wheeler transform and this is great. But I know what Sean is going to ask which is, how does this relate to pattern matching? And I'm curious about it too. So hopefully, he'll be able to tell us what it is Burrow-Wheeler transform which is amazing stuff. And I think from a mathematical perspective, is really cool. But what is it that it has to do with pattern matching? How is it that we could possibly use it for pattern matching to try and get the constants down in terms of storing efficient data structures for working with pattern matching? So hopefully, he can tell us about that.