.finally Finally we're going to look at suffix arrays and string processing using

this data structure that has really played a very important role in, string

processing applications that would not otherwise be possible.

To get a feeling for the idea, we're going to look at a really old idea that you're

actually familiar with called keyword in context search.

And the idea is that you're given a text, a huge X and what you want to do is

pre-process it to enable fast substring search.

That is, you want a client to be able to give a query string and then you want to

give all occurrences of that query string in context.

So, if you look for the word, search, it will give all the occurrences of where the

word search occurs in context. Or better thing is another one.

And that's a, a very common operation. One, you're certainly familiar with it

from web searching, in your browser. And there's many other applications.

This is a pretty old idea that dates back to the late 50's, early 60's people have

always wanted to do this. And there's an easy way to look at it

called suffix sorting. The idea is you take your input string,

and then, form the suffixes. remember when I talked about at the beginning with

JAVA's string data type, you can get this done in linear time because each suffix is

basically a pointer back into the input string.

So the suffix, remember, the I suffix just start the character I and take the rest of

the string. So, what this does is, it gives, sort

keys, that, contain, kind of the, the, Pieces of the string itself in context and

all we do is just, run a sort on that suffix.

And what that sort does is, it brings if you do a search, it brings the things that

you're searching for, close together. And, you can use, once you've done that

suffix sort, you can use a binary search to find all occurrences of the string out

of there. That's the, basic idea of a keyword and

context searching. You suffix sort the text, then do binary

search to find the query that you're looking for.

And then you can, scan for wherever the binary search ends up.

And so, this is all the places where, the word search occurs, in the text of Tale of

Two Cities, And then, you can use this index, to,

print out the context of whatever's needed.

It's a fine and effective way, for solving this important practical problem.

And that's interesting, but I want to talk about another really important problem

that, it has, hugely important applications. This is the longest repeated

substring problem. So you're given a string of characters,

find the longest repeated substring. in this case, this is genomic data, and,

there's the example of the longest, repeated substring. And now, in scientific

data, the long repeated substring is often something that scientists are looking for.

And so, and these strings are, are huge. it might be billions of these, characters.

And so, it's really important, not only to know that you can find long substrings

efficiently, but also, that you can do it for, huge, huge strings.

Another example is, cryptanalysis. Where, a long repeat in a file that's

supposed to be, encrypted random file indicates that somebody did something

wrong. It might be the key to, breaking the code.

Another example is data compression. When you've got a file that's got a lot of

repeated stuff in it, you might, want to do this operation, to take advantage of,

of these long repeats and not keep multiple copies of them.

Here's another example, this for, studying or visualizing repetitions in music.

In this example, every time there's a, a repeat of the notes then, there's, a, an

arch drawn to, visualize the repeat. And it's, the arch is thick if, the number

of repeats is long. And it's high if the repeats are far away.

And this, tells, is an interesting way to analyze, repetitions, in music.

So, how are we going to solve this problem?

Very simple to state problem. Given a sequence of N characters, find the

longest repeated substring. As with many problems, there's, an easy,

brute force algorithm, that, at least gives us an idea of what, what the

computation is like. But it's not going to be useful for huge

strings. And that's you try all possibilities, all

pairs of indices I and J and then just compute the longest common prefix for each

pair. The problem with that is that if N is the

length of the string, you've got, N squared over two pairs.

And for every one of those pairs, you might have to, match them up to length D.

It's definitely quadratic time, more than quadratic time in the length of the

string. And can't be using that for, you're not

going to be able to use that for strings that are billions of characters long.

Another way to look at one way to solve the longest repeated substring, is to use

a suffix sort. In fact, just we talked about before.

So, I would take out our input string. Reform the suffixes.

And then sort the suffixes and that brings the long repeated substrings together.

So, that's a quite elegant and efficient solution to this problem. just build the

suffix array that's a linear time process. Do the sort.

We, we just saw we can do that, with n log n character, character compares.

And then go through and find the repeated the substrings.

So, it's just one passthrough to check for adjacent substrings to see which one's the

longest. And this is very easy to code up.

Here's the Java code for computing the longest repeated substring.

We get the length of our string out. We build our suffix array.

Remember that's linear time and space because of Java string implementation

allows us to do substring and constant time.

Then we go ahead and sort the suffixes, and then find the least common prefix

between the, adjacent suffixes in sorted order.

Just keep track of the max so that's longest, repeated substring.

And if so, for example, this sentence about such a funny, sporty, gamy, jesty,

joky, hoky-pokie, lad, is the longest repeated substring in the text of

Melville's Moby Dick. And now we run that one to, prove that

we've got a linear-rhythmic algorithm because there's a lot of characters in

that. And you're not going to find this, without

a good algorithm. And, and this is just a humorous approach

to what we've, talked about today. If you have, five scientists that are

looking for a long substring, in a genome, that, they might encounter, this problem,

with a billion nucleotides, and there are plenty of scientists that are not aware

of, important algorithms are the ones, like the ones that we are talking about.

And the one that uses the good algorithm is definitely more likely to find a cure

for cancer. But there is a flaw even in this, that

computer scientists discovered as they got into ever more complex algorithms for

problems like this, and that is if the, the longest repeated substring is long,

there's a problem. So this is just some, experiments, for,

finding the longest, repeated substring in various files.

From just a program to, the text of Moby Dick.

Or a chromosome with 7.1 million characters.

And, using, the, brute force method, you get stuck, and too slow, very soon.

But using, a suffix sort, you can get these jobs done, even for huge files like

the first ten million digits of pi. By the way, if there was a really long

repeated substring in the first in the digits of pi, that would be news.

Maybe it would help us, understand more, about this number.

So lots of people, write programs of this sort.

But the big problem is if you have a really long repeated substring, then

suffix sort is not going to work. Our fast algorithm, doesn't complete.

So what's going on? In the explanation is really simple.