0:11

This part is devoted to another classical problem, the edit distance problem.

The Edit Distance between two strings or

two words is the minimum number of single letter modifications

needed in order to get the second string from the tests trained.

By single letter modifications here, we mean insertions,

deletions, and substitutions.

So edit distance is also known as Levenshtein distance,

after Vladimir Levenshtein, who introduced it in 1960s.

This is a natural measure of similarity between two words or two strings.

For example, when a spell checker encounters a misspelling,

it suggests the word which is

close to the misspelled word with respect to typing errors.

And insertions, deletions, and substitutions

are exactly the well known typing errors.

Okay, so edit distance or generalizations of edit distance are also

used in computational biology to align sequences.

Now, let's, let's give an example.

Now, the example we're going to turn the word editing into the word distance.

So for this, we first remove the letter E.

Ok, so we then insert the letter S.

Our next operation is to replace the letter I with the letter A.

1:56

We then replace G by C, and finally we insert E, okay?

So another convenient way of showing

this sequence of operations is the so called alignment.

So what you see here on the slide is an alignment of two

words that you're editing in distance.

In this alignment, any column is either a deletion, or

an insertion, or a substitution or a match.

So a match is a column where we have the same two letters.

An insertion is across a column where we have a gap in the top and

we have a letter at the bottom.

And deletion is a column where we have a letter at the top and a gap at the bottom.

And a substitution is a column where we have two different letters.

So in this case, we define the cost

of each column to be equal to 1 for each column except for matches.

So the cost of matches is equal to 0.

So in this case, the cost of this alignment is equal to 5.

It is not difficult to show that for any sequence of operations,

there is an alignment of the same cost, and vice versa.

For any alignment of some cost, there is a sequence of operations where the number

of operations is equal to the cost of the alignment, of the initial alignment.

So for this reason,

what we're going to be looking for is the least expensive alignment.

It's usual in the design of dynamic programming and algorithms.

Our goal now is to come up with the right notion of the subproblem.

For this, let's analyze the structure of an optimal solution.

So in this case, an optimal solution is an optimal alignment.

In particular, let's focus on it's last column.

So what we know about this last column is that it is either an insertion or

a deletion, or a match or a mismatch.

4:12

Okay, so assume for the time being that it is an insertion.

Then what we know for sure is that is has a last symbol of the last character of B

that is inserted, which means that in this column, we have a gap at the top.

And we have the last element of B at the bottom, okay?

Now let's consider what happens if we cut this last column out of this alignment.

Then what is left must be an alignment of the word A with

the prefix of lengths n- 1 of the word B.

And what is more, this alignment must be optimal.

Why is it?

Well begin by the cut and paste tree.

Because if this alignment, if this sum alignment is not optimal,

meaning that there is some other less expensive alignment.

Then we can cut this initial alignment and

replace it by less expensive alignment here, okay?

So another possibility is that the last column is a deletion.

Then, what we know is that in this case, we delete the last symbol of A.

And what is left if we cut the last column,

must be an optimal alignment of the prefix of A, and the word B.

And the last possibility is that if in the last column, we have two letters,

then what we know is that if we cut this column, then what is left must be

again an optimal alignment of the prefix of lengths n- 1 of the word E.

And the prefix of the lengths m- 1 of the word B, okay?

This suggests that we define our subproblems as follows.

Let's define ED(i,j) as the least expensive alignment or just the cost

of the least expensive alignment of the prefix of lengths i of the word A.

And the prefix of length j of the word B.

Then, our previous analysis,

shows that we can compute ED(i,j) as follows.

So this alignment for prefixes is again,

its last column is either an insertion, a deletion or a match or mismatch.

Recall that we pay 1 for insertions, for deletions, and for mismatches.

And we pay nothing for matches.

Okay, this allows us to write the following recurrence relations for

ED(i,j).

ED(i,j) is equal to minimum of ED(i-1,

j) +1 or ED(i, j-1) +1 or

ED(i-1, j-1) + the difference

between the i symbol of A and the j symbol of B.

And by this difference, we mean the following simple function.

If it is equal to 1, if the corresponding two symbols are different.

And it is equal to 0 if they are the same.

This basically corresponds to the fact that we pay 1 for

different symbols or for mismatches.

And we pay 0 or nothing for matches, okay?

So the base case for this recurrence relation is the following.

When i = 0, ED(i,j) = j.

And similarly, when j = 0, then ED(i,j) = i.

Why is that?

Well in this case we just tried to find an optimal alignment of

an empty prefix of some worth and with some, possibly none empty.

A prefix of some other word.

In this case, obviously, an optimal way of getting one prefix

from the empty prefix or getting the empty prefix from some other

is either the insert all the elements or delete all the elements.

And in this case we need exactly i or j operations, okay.

So once again, when we have a recurrence relation,

it is just a technicality to transform it into recursive algorithm.

So for this, we just introduced, we just initialize the table.

And then each time when we make a recursive call, we first check with

the the solution for the corresponding subproblem (i,j) is already in the table.

And if it is in the table, we immediately return the result.

If it is not in the table, we compute its value by making three recursive calls.

Or if it is just a base case we compute it in a straightforward way.

And when we compute it, we save it through the table, in order to avoid

recomputing the solution for the same subproblem again, okay?

Now our goal is to convert this recursive algorithm into an iterative algorithm.

So recall that the space of all our subproblems is all the values of (i,j),

where i and j range from 0 to n and m respectively.

This suggests actually that we can use just the 2D table for

storing all the solutions for our subproblems.

And this is what we're going to do.

9:56

Now let's also realize the following fact,

ED(i,j) depends on three previous values.

It depends on ED(i-1, j-1),

ED(i-1,j) and ED(i, j-1).

So, our order of solving all the subproblems should respect these facts.

So in other words, when computing ED(i, j),

we need to make sure that all the previous three values are already computed.

And there are actually two natural orders that respect this fact.

We can either fill in the table row by row, or we can do this column by column.

In both cases, it is easy to check that when computing some particular value,

all the previous three values are already computed.

10:57

Okay, and then it is straightforward to implement

the corresponding iterative algorithm.

So we allocate it to the table.

We first initialize its zeroth row and 0 column.

And then we fill in this table row by row following our recurrence relation.

Let's now give an example of how this table looks like for our toy

example of computing the added distance between the words editing and distance.

So we start by filling in the 0s row in the 0's column of the table.

We then start computing the value D of 1 at 1,1.

It is computed as the minimum value of D at 0,1 +1,

D of 1,0 +1, and D 0,0 +1.

In this case, the minimum is attained by 0,

0, 0 +1 and it is equal to 1.

So we put this value in this cell.

And this corresponds to the fact that the edit distance between prefix of 1,

one word and the prefix of 1 in the second word is equal to 1.

Which makes sense, because these are different letters, D and E.

We then go to the one for computing the ED of 1,2.

And it is computed and it is equal to 1.

And again, it makes sense,

because the edit distance between D and ED is equal to 1.

We just need to insert E, right?

So we continue in the same fashion.

And when the whole table is filled in,

we see that at the bottom right corner, the value is equal to 5.

And this basically says that the edit distance between

the whole two words is equal to 5, okay.

This convinces us that the sequence of operations that we considered

in the beginning of this video is in fact, optimal, okay?

As a final remark, let's discuss how to implement a brute force solution for

this for this problem.

So for this, we would extend an optimal alignment element by element.

So at each recursive call,

we will try to append a new column to our current alignment.

And then it would remain to notice that in order to extent our partial alignment,

it is enough to know only the last column of this alignment.

Then again, just by adding the memorization trick for

the corresponding recursive algorithm,

one would get a similar algorithm that we just designed in this part.