0:00
.finally Finally we're going to look at suffix arrays and string processing using
this data structure that has really played a very important role in, string
processing applications that would not otherwise be possible.
To get a feeling for the idea, we're going to look at a really old idea that you're
actually familiar with called keyword in context search.
And the idea is that you're given a text, a huge X and what you want to do is
pre-process it to enable fast substring search.
That is, you want a client to be able to give a query string and then you want to
give all occurrences of that query string in context.
So, if you look for the word, search, it will give all the occurrences of where the
word search occurs in context. Or better thing is another one.
And that's a, a very common operation. One, you're certainly familiar with it
from web searching, in your browser. And there's many other applications.
This is a pretty old idea that dates back to the late 50's, early 60's people have
always wanted to do this. And there's an easy way to look at it
called suffix sorting. The idea is you take your input string,
and then, form the suffixes. remember when I talked about at the beginning with
JAVA's string data type, you can get this done in linear time because each suffix is
basically a pointer back into the input string.
So the suffix, remember, the I suffix just start the character I and take the rest of
the string. So, what this does is, it gives, sort
keys, that, contain, kind of the, the, Pieces of the string itself in context and
all we do is just, run a sort on that suffix.
And what that sort does is, it brings if you do a search, it brings the things that
you're searching for, close together. And, you can use, once you've done that
suffix sort, you can use a binary search to find all occurrences of the string out
of there. That's the, basic idea of a keyword and
context searching. You suffix sort the text, then do binary
search to find the query that you're looking for.
And then you can, scan for wherever the binary search ends up.
And so, this is all the places where, the word search occurs, in the text of Tale of
Two Cities, And then, you can use this index, to,
print out the context of whatever's needed.
It's a fine and effective way, for solving this important practical problem.
And that's interesting, but I want to talk about another really important problem
that, it has, hugely important applications. This is the longest repeated
substring problem. So you're given a string of characters,
find the longest repeated substring. in this case, this is genomic data, and,
there's the example of the longest, repeated substring. And now, in scientific
data, the long repeated substring is often something that scientists are looking for.
And so, and these strings are, are huge. it might be billions of these, characters.
And so, it's really important, not only to know that you can find long substrings
efficiently, but also, that you can do it for, huge, huge strings.
Another example is, cryptanalysis. Where, a long repeat in a file that's
supposed to be, encrypted random file indicates that somebody did something
wrong. It might be the key to, breaking the code.
Another example is data compression. When you've got a file that's got a lot of
repeated stuff in it, you might, want to do this operation, to take advantage of,
of these long repeats and not keep multiple copies of them.
Here's another example, this for, studying or visualizing repetitions in music.
In this example, every time there's a, a repeat of the notes then, there's, a, an
arch drawn to, visualize the repeat. And it's, the arch is thick if, the number
of repeats is long. And it's high if the repeats are far away.
And this, tells, is an interesting way to analyze, repetitions, in music.
So, how are we going to solve this problem?
Very simple to state problem. Given a sequence of N characters, find the
longest repeated substring. As with many problems, there's, an easy,
brute force algorithm, that, at least gives us an idea of what, what the
computation is like. But it's not going to be useful for huge
strings. And that's you try all possibilities, all
pairs of indices I and J and then just compute the longest common prefix for each
pair. The problem with that is that if N is the
length of the string, you've got, N squared over two pairs.
And for every one of those pairs, you might have to, match them up to length D.
It's definitely quadratic time, more than quadratic time in the length of the
string. And can't be using that for, you're not
going to be able to use that for strings that are billions of characters long.
Another way to look at one way to solve the longest repeated substring, is to use
a suffix sort. In fact, just we talked about before.
So, I would take out our input string. Reform the suffixes.
And then sort the suffixes and that brings the long repeated substrings together.
So, that's a quite elegant and efficient solution to this problem. just build the
suffix array that's a linear time process. Do the sort.
We, we just saw we can do that, with n log n character, character compares.
And then go through and find the repeated the substrings.
So, it's just one passthrough to check for adjacent substrings to see which one's the
longest. And this is very easy to code up.
Here's the Java code for computing the longest repeated substring.
We get the length of our string out. We build our suffix array.
Remember that's linear time and space because of Java string implementation
allows us to do substring and constant time.
Then we go ahead and sort the suffixes, and then find the least common prefix
between the, adjacent suffixes in sorted order.
Just keep track of the max so that's longest, repeated substring.
And if so, for example, this sentence about such a funny, sporty, gamy, jesty,
joky, hoky-pokie, lad, is the longest repeated substring in the text of
Melville's Moby Dick. And now we run that one to, prove that
we've got a linear-rhythmic algorithm because there's a lot of characters in
that. And you're not going to find this, without
a good algorithm. And, and this is just a humorous approach
to what we've, talked about today. If you have, five scientists that are
looking for a long substring, in a genome, that, they might encounter, this problem,
with a billion nucleotides, and there are plenty of scientists that are not aware
of, important algorithms are the ones, like the ones that we are talking about.
And the one that uses the good algorithm is definitely more likely to find a cure
for cancer. But there is a flaw even in this, that
computer scientists discovered as they got into ever more complex algorithms for
problems like this, and that is if the, the longest repeated substring is long,
there's a problem. So this is just some, experiments, for,
finding the longest, repeated substring in various files.
From just a program to, the text of Moby Dick.
Or a chromosome with 7.1 million characters.
And, using, the, brute force method, you get stuck, and too slow, very soon.
But using, a suffix sort, you can get these jobs done, even for huge files like
the first ten million digits of pi. By the way, if there was a really long
repeated substring in the first in the digits of pi, that would be news.
Maybe it would help us, understand more, about this number.
So lots of people, write programs of this sort.
But the big problem is if you have a really long repeated substring, then
suffix sort is not going to work. Our fast algorithm, doesn't complete.
So what's going on? In the explanation is really simple.
11:48
And so, I want to describe that, briefly. Actually, these two computer scientists,
one of them went on the become chief scientist of an internet company.
The other one went on to become chief scientist of a company that won the race
in sequencing the genome. In both cases, good algorithms for
processing long strings are very important.
And this is a great algorithm. Now, it's a little too complex to give all
the details in a lecture, but I can give a pretty good idea of how it works.
So the overview is that, it's kind of like an MSD, sort.
But, what it manages to do is double the number of characters that it looks at in
each passthrough of the array. And, the idea is that, you, since you're
doubling the number of characters that you look at each time.
It, it finishes in, log in time that's the size of the, suffix array.
If, if you double until you get n, it's log n.
And it turns out, what's really ingenious about the algorithm is that, you can do
it, do each phase in linear time. So, this is an example of how it works.
So, it's, we going to do a suffix sort. And then I'll, I'll talk about the least
common prefix as well in a minute. So, sorting the first character, while we
just use key, key index counting or something like that.
So we know how to do that, in linear time. And then the idea has given that sorted on
the first character. We can sort it, easily sort on the first
two. Well again, we could use, key index
counting. So now, what we can do is, double the
number of characters that we, involve each time.
So, the next phase of the Manber-Myers algorithm, now gets it sorted on four
characters. And, of course, as we, the more characters
we have sorted on at the leading part, the smaller the [INAUDIBLE] files in the
trading, in the trailing part. So, that's, certainly a feature.
And then, in this case, after we get to, eight characters, and all our sub files
are of size one, and we're done with the sort.
14:11
Now, the key to the algorithm is, the idea that, once it's going, it uses what's
called an index sort, Which essentially means that it can do
compares in constant time in the middle of the algorithm.
Now, lets just take a look at, at, at how that works.
So lets look at, trying to compare string zero with string, nine.
And we know that we've already got the thing sorted on, four characters and we
want to sort it on eight So, zero and nine are equal on the first four characters.
They're in the same subfile. So, now we're going to get it sorted on
the other. But the thing is if we add four to our
given indices. So, we're at zero, and we add four.
That gets us to four, and we're at nine. We add four that gets us to thirteen. we
already know that the thing is sorted on those characters.
And, we know that, thirteen appears before four in this sorted list.
It's already sorted. And we can keep track of that by, using an
inverse array which says, for every index, where it appears in the sorted order.
So, this says that thirteen appears at position six, and four appears at position
seven. That is, thirteen appears before four.
So, when we're trying to compare, the, this, string here, that's the zeroth
suffix with the ninth suffix, We can go in and again, add four to get
four, add four to get thirteen go into the index array and see which one is less and
the one that appears first is going to be less.
So, that's a constant time, comparison. Thirteen is less than or equal to four
because the inverse of thirteen is less than the inverse four, which means that
suffixes of nine, if you take eight characters out of nine and eight
characters out of zero, it's less. It's a really, simple and of course,
Easy to implement operation. So, maintaining the inverse, I can get
constant time string compare. So, what does that imply, for the whole
suffixsort??uh Well,
With a constant time compare, then, I can get, at a minimum, I can get an, an analog
N sort, just by using quicksort or mergesort.
And then, I get much faster than quadratic performance no matter what the strings
are. That's a big key to the success of this
algorithm. And actually, if, if you use a version of
three-way quicksort, it's been proven that it even gets down to linear time for the
sort, No matter what the input is.
That's one thing, and then the other thing is when you need to do, the longest common
prefix to look for the, longest match after the array is sorted you can also do
this constant time, string compare and get the job done.
So this is really an in-, ingenious algorithm that, can get, suffix sorting
done very fast. even in the presence of, a huge repeat.
And I think really underscores, the importance of careful algorithmic
thinking, in addressing new computational challenges.
So, string sorting, just to summarize, is a really, interesting area of inquiry.
For one thing, we can develop linear time sorts for many, many applications.
We, we thought, we're doing well when we had, n log n sorts, but actually we can do
much better for many applications. And in fact, we can even get to sublinear
where we don't even have to examine all the data.
In today's world, when we have huge amounts of data, to be able to, get it
sorted without even looking, at it at all is really quite a miracle.
Three-way string quicksort,,you you can't do better than that in terms of the
numbers of characters that you have to, examine.
And it's a lot of deep mathematical analysis, behind that result.
But I think the other lesson to learn is that algorithms like three-way string
quicksort and Manber-Myers, show that we really have a lot to learn still in string
processing. And they're not really random.
In fact, a lot of times we're looking for non-randomness and we might want to use
specialized algorithms. So, at string processes or introduction to
string processing, We're going to look at many other string
processing examples in the coming lectures.