0:00

Hi, in this module, Algorithmic Challenges,

you will learn some of the more challenging algorithms on strings.

We will start with the Knuth-Morris-Pratt Algorithm for exact pattern matching.

It allows to find all occurrences of a pattern in the text,

in the time proportional to the sum of the length of the pattern and the text.

0:20

Then, we'll proceed to learn the algorithm for

building suffix array of a string in time n log n.

And after that, you will learn how to build a suffix tree,

given a suffix array of the string in linear time.

Of course, you already know how to build a suffix tree of the string ,but

the algorithm you know is n squared.

And that doesn't allow you to take really long strings of millions or

billions of characters and build suffix array for them in reasonable time.

And the n log n algorithm will allow you to tackle that.

So first, let's recall what exact pattern matching means.

0:58

The problem is very simply formulated, you are given a long text and

the pattern that you need to find in it.

And you need to find all the positions where the pattern starts in the texts,

all the positions, and when number positions from zero in the whole module.

1:16

You learned brute force algorithm for that,

which basically slides the pattern down the text, and the running time of that

algorithm is the product of the length of the text and the length of the pattern.

What we are going to do in this lesson is to improve this time to the sum

of the length of the text and the length of the pattern, but first,

let's recall how brute force algorithm works.

It first aligns the pattern and the text such that the pattern starts from

the zero position in the text, and tries to match the pattern by

comparing it character by character to the corresponding characters of the text.

1:59

And in this example, we find pattern right away in the position number 0.

So we add position number 0 to the output, and

then we slide pattern one position to the right.

And we compare the first symbols and they don't match, so

we slide the pattern again, and again, no match.

And again, and then we just slide the pattern.

We compare the symbols and if they don't match, we slide the pattern to the right.

And then in the last possible position,

we again find the occurrence of the pattern in the text.

And so our output is a list of positions 0 and 7, and

those are all the positions where the pattern occurs in the text.

So the question is now, can we avoid some of the positions where

we tried to align the pattern with the text?

But it actually didn't made sense given what we already know about the previous

comparisons.

2:55

And the answer is yes, we can.

In this particular example, we've already found the pattern in the text starting

from the 0 position, and that means that when we slide the pattern to the right and

try to align it to the first position in the text.

We're going to compare the prefix of the pattern without the last character

with the suffix of the same pattern without the first character.

And they don't match, so

there is no occurrence of the pattern starting from the first position.

But, if we somehow pre-processed the pattern and knew that the prefix

without last character is not equal to the suffix without the first character,

then we could just keep all this alignment with the first position in text at all.

And the same is true about the next position of the pattern.

If we knew that the prefix of the pattern of length two,

is not equal to the suffix of the pattern of the length two.

Then we could skip also positioning the pattern against the second position

on the text, and then, when we slide one position to the right again.

We compare this prefix of length one which is letter a,

with the suffix of length one which is also letter a.

And these are equal, so this position, we cannot skip according to this rule.

But it means that instead of comparing the pattern to the text in positions zero,

one, two and three, we could just safely move the pattern from the position zero

to the position three, skipping positions one and two.

And in the more general case, we could skip even more positions,

depending on what is the life of the pattern, and

how much of its prefixes coincide with their corresponding suffixes.

Another example is when we don't even find the whole pattern in the text.

We still can skip some of the positions.

So in this example, the longest prefix which is common for

the text and the pattern consists of six characters and the pattern is longer.

5:17

But instead we need to do the same thing with the string marked in green.

We need to compare prefixes of this string with suffixes of this string.

And we can notice that the first position where the prefix of a string

can coincide with the corresponding suffix Is position number four.

So we can just move the whole pattern from position zero

to the position four in the text.

And then, try to compare the pattern with the text and we find an occurrence.

We couldn't find an occurrence earlier, because no longer prefix of the string a,

b, c, d, a, b, coincides with the corresponding suffix.

And another example,

again we find the longest common prefix of the pattern in the text.

It has length 6, and is the string a, b, a, b, a, b.

And for this string the longest prefix which coincides with the corresponding

suffix is a, b, a, b of length four.

And that means that we can move the pattern two positions to the right, and

skip the alignment at position one of the text.

6:25

Now we again find the longest common prefix of the pattern, and

the suffix of the text starting in position two.

It again has length six, which means that there is no

occurrence of the whole pattern in the text in position two.

But we need to consider the string a, b, a, b, a, b,

which is the longest common prefix.

And again, compare the prefixes of this string with the suffixes.

And we already know that the longest prefix which coincides with the suffix

is a, b, a, b, of length four.

So we can again move the pattern to the right, so that the prefix and

the corresponding suffix match.

And now we find the occurrence of the pattern in the text.

To make an algorithm from these observations,

we will need the definition of a border.

So border of a string is a prefix of the string

which is equal to the suffix of the string of the same length.

For example, for string arba, a is a border,

because the prefix a is equal to the suffix a.

And ab is a border of a string a, b, c, d, a, b, which we saw in the second example.

And a, b, a, b, is a border of a, b, a, b, a, b.

7:36

And do you notice that the prefix a, b, a, b intersects with the suffix a, b, a, b?

And that is okay and we just mark the fact that they intersect with an orange color.

But actually we notice that not just a, b is a border but

also a longer string of line four which is a, b, a, b is also a border.

However string a, b is not a border of the same string a,

b because we require that the border doesn't coincide with the whole string.

We need only those prefixes and suffixes which are shorter than the initial string.

Now, let's consider shifting all the pattern along the text in

the general station.

So, the first thing we do is we find the longest common prefix of the pattern

with the suffix of the text to which we've just aligned our pattern.

Then we find w, the longest border of u, so

that there is a w in the beginning of u, and there is one in the end of u, and

also mark both w's in the text T.

Now, I suggest to move pattern P to the right in such a way that

the first w in P coincides with the second w in T.

8:50

And that is the way I suggest to

skip some of the positions where we don't need to align pattern with the text.

Now you know that it is possible to avoid some of the comparisons that the brute

force algorithm does.

And I've suggested a specific general way to do that.

But we don't want to miss any of the occurrences of the pattern in the text.

So the question is, is it really safe to move the pattern in the suggested way?

You will learn that in the next video.