Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

Welcome back. This is Selcuk Candan a

professor of Computer Science and Engineering at Arizona State University.

We are learning about sequences and time series,

and this segment is on what we call

the edit distance algorithm to support approximate sequence matching.

So, before we go ahead with the edit distance algorithm,

details of edit distance algorithm,

let's revisit, and remember some of the key concepts that we have learned so far.

So first of all,

a string or a sequence is defined as a sequence of events,

or sequence of symbols from a given dictionary, or alphabet.

We'll again use this simple definition.

In the previous video lectures,

we have learned how to implement what we called prefix search, and subsequence search.

In this video lecture,

we'll focus on sequence similarity.

And our goal will be to compute the similarity,

or distance between a query sequence,

and other sequences that we may have in our database.

For example, the question is in our database we have the sequence cable,

tale and tackle, and we are given the query sequence table.

And the question is,

which of the sequences in our database is more similar to our query sequence table.

Is it cable? Is a tale?

Is it tackle? How can we evaluate them?

And how can we order them relative to their similarity to our query pattern?

This is obviously important because as we are observing events in the real world,

or as we are observing event sequence in the real world,

we want to be able to match them onto event sequences that are stored in the database,

and find those sequences that are similar to what we have just observed.

The question is, how can we do that?

Well, so this is called,

the approximate string match problem.

Again, our goal is giving a query sequence.

We want to compare that into other sequences that we have in our database,

and we want to do this efficiently.

We want to do this efficiently because there may be a lot of sequences in our database,

and if this comparison is very inefficient,

it may take a lot of time to be able to tell the user,

"Oh we've seen something similar to what you are interested in."

So, what is edit distance?

Well, edit distance is a very simple concept.

Let's assume that we have "table",

and "cable" so these are two sequences.

As you can see,

these sequences are very similar to each other.

In fact, I can convert cable into

the sequence table just by replacing the "t" with "c".

That is one replacement operation is sufficient to convert table to cable, or vice versa.

What about "table" and "bale"?

Well," table" and "bale" also similar,

but there are multiple ways I can convert "table" to "bale'.

One way to convert "table" to "bale" would be to delete "t".

But that's not sufficient because when I delete "t",

I have "able" and "bale".

These sequence also similar to each other,

but I still have to do other operations to be able to convert them to each other.

For example, I need to replace the first "e" with a "b",

and I need to replace the second "b" with an "a".

If I do that then I obtain "bale" which is equal to my data sequence.

Note that to convert my query sequence in this case to my data sequence,

I had to do three edit operations. What does this mean?

Well, it means that if I count the number of edit operations that

I do to convert my query string to

my data strings "table" is

more similar to "cable" because I can do it with only one edit operation.

"table" is less similar to "bale" because to be able to

implement my conversions I need to do three edit operations.

Essentially, edit distance measure

counts the number of edit operations that you have to do to

convert the query string to your data string, that is it.

And the common edit operations are as you're seeing in these examples, insertion,

insertion of a symbol, deletion,

deletion of an existing symbol,

or replacement, replacement of one symbol with another.

So, then the question essentially becomes,

given a query sequence, and data sequence,

how can I find the number of edit operations that

I need to do to convert one sequence to the other?

Because if I can do that then I can compute edit distance,

which essentially means that I can then sort

the data that I have relative to how well they match to my query sequence, and I am done.

So, let's formalize this operation.

We are given two strings.

We are also given a set of edit operations.

Usually, we are also given a cost associated to these edit operations.

Because in different applications,

different edit operations may have different costs.

For example, replacement maybe more expensive than an insertion,

or sometimes we love insertions,

but we don't allow deletions,

which means that insertion is cheaper than deletion.

So, this is in general application-specific,

but whatever it is we have a cost associated with edit operations.

And essentially what we are trying to do is we are trying to find

a sequence of edit operations that convert our given string to the other string,

and among all such sequence of edit operations that convert our one string to the other.

We try to find the sequence with the overall minimum cost.

Overall cost meaning, the summation of the cost of the individual edit operations.

So, my way of implementing this would be,

find all possible sequence of edit operations.

For each sequence of edit operation that you can find, compute it's cost,

overall cost, and then among all these sequence of edit operations,

simply select the sequence that gives you the minimum cost.

Obviously, this can be very, very,

very expensive, because there may be many, many,

many, many ways to convert a given string to another one.

So, then the question becomes is there an efficient.

Fast way to find

this sequence of eight operations that gives us the minimum costs,

that will become essentially question that we will answer in this video.

Well, what we will learn now is an algorithm to implement

this edit distance cost problem.

So, in this case,

we are given two strings,

string P and Q.

The string P is of length N,

the string Q is of length M,

and for the simplicity,

we will assume that all edit operations have the same cost,

cost equal to one.

We will basically go over steps of the algorithm that computes

the edit distance cost of converting the string P to string Q.

Let's see how that works.

Let us define, if D[i,j] is the number of

edits from length-i prefix of P to length-j prefix of Q.

So, we are basically considering all prefixe of P and all prefixe of Q,

and we defined D[i,j] as the number of edits that

convert the length-i prefix of P to length-j prefix of Q.

We will see how this is going to be used.

Let's see the first step of our algorithm.

In the first step of our algorithm,

we set D[0,j] to j. What does this mean?

This means that if we are given a P of cost, which is empty,

and we want to convert that into the j-length prefix of string Q,

in order to be able to do that,

we need to do a total of j insertions.

If I do a total of j insertions,

I can convert this into my j-length prefix of my string Q.

So, the cost of this would obviously be j.

Great. What about the i, 0?

So, the i, 0 means,

I'm looking at the i-length prefix of my string P,

and I am basically trying to see if I can convert this into the 0-length prefix of Q,

which means the Q is empty.

How can I do that?

Well, straightforward.

I will need to implement i deletion operations.

If I implement i deletion operations,

I can empty also this.

At that point, P will be equal to the empty prefix of Q, so they will be equal.

So, the cost of this would be i edit operations, i deletion operations.

Excellent. So, these are my basic two boundary conditions.

Now, we will get into a for loop.

In the for loop, I will consider all prefixes of i and I will consider all prefix of j,

and I will have several conditions.

The first condition is what happens if the symbol Pi is equal to symbol Qj.

So, I'm looking at the i-length prefix of Pi,

j-length prefix of Q,

and I am comparing these two symbols.

Let's assume that they are equal.

Well, if they're equal,

I don't need to do any replacement or any deletion or

any insertion to make this symbol equal to this symbol.

What does this mean? If I need to make this entire prefix,

the cost would be simply converting the i minus 1 length prefix

of string P to j minus 1-length prefix of string Q.

So, D[i,j] in this case,

would be simply equal to D,

i minus 1 and j, i minus 1.

Beautiful.

So, now I know if Pi equal to Qj,

how to compute the edit distance using a recursive function.

What if Pi is not equal to Qj? What can I do?

Well, in that case, I have three conditions.

I may do three different things.

The first thing would

be insertion of Qj as the last symbol,

or I can delete the last symbol of Pi,

or I can replace

the Pi symbol with the Qj symbol.

So, I have three different things I can do.

Each one of these have one cost.

The first operation has one insertion cost,

the second possible operation has one deletion cost,

and the third possible operation has one replacement cost.

Since I assume that all edit operations have the same costs,

these all have one cost.

So, the cost of D[i,j] is then 1

plus the minimum of,

in this case, D, i minus 1, Qj, or D,

i to Qj minus 1 match,

or D, i minus 1 and J,

i minus 1 match.

So, this gives me

the recursive function to compute the D[i,j] cost.

Note that once I solve

this recursive operation and once I compute all my distances, finally,

D(N,M) will give me the edit distance

between my length N string P,

and length M string Q.

This is great. So, what is the problem?

The problem is, as I mentioned,

in general, all edit operations may not have the same cost.

In practice, different edit operations may have different costs.

For example, as we said in some applications,

insertions may be cheaper,

deletions maybe more expensive.

Or In general, insertion and deletions

might be cheaper than doing a replacement operation.

Or in some applications,

inserting one particular symbol say A is more expensive than inserting another symbol B.

So, which means that, in practice,

we cannot assume that all edit operations will

always have the same cost. What do we do with that?

Well, changing the edit distance algorithm that we

have just described to that scenario is not very difficult.

So, the only thing that we have to do is instead

of adding one in all of these three situations,

we add a cost

specific to the edit operation that we are doing at all these three situations.

Let's remember, in the first situation,

what we are doing is inserting the symbol Qj,

so the cost will be insertion of Qj.

Whatever that cost is, that's application specific.

In the second situation here,

we are deleting Pi,

so the cost will be deletion of the symbol Pi.

Whatever that cost is, that's application specific.

In the third situation here,

we are replacing Pi with Qj,

for the cost of this will be replacement cost of Pi with Qj.

Whatever that cost is,

that is application specific.

So, if we implement this,

once we go over the full incursion,

we will find D(N,M) which will give us

the edit distance of converting

N length string P to M length string Q with application specific insertion,

deletion, and replacement costs.

So, this means that basically,

whatever my application is,

I can use this algorithm to find my edit distance between two strings.

That's great. What is the problem?

Well, the problem is that, as you can see,

since I am doing this cost computation for all i less than,

equal to N and for all j less than, equal to M,

the cheapest way of executing this will have at least N times M cost.

This means that if I have long strings,

the edit distance computation will be very, very, very expensive.

Simply, let's assume that N is of length thousand,

so I have a thousand length string,

and say my M is also length thousand,

if I have another M string that is of thousand length,

the cost of computing the edit distance will be a million.

It's easy to see that the computation of edit distance become more and more and more

costly as my data and query strings become longer and longer and longer.

So, to recap, edit distance can be

used to assess how similar or different two strings are to each other.

However, the key problem is that edit distance computation can be very,

very, very costly when we are matching long strings.

In the next video,

we will learn about ways to speed up the process by computing

edit distance instead of precisely doing approximate computations.

See you in the next lecture.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.