If they are aligned,

we say that match for the right hash function is equal to one.

So the idea is very simple.

So given p and q,

consider all possible hash orderings.

There are r of them compute with

the corresponding smallest BW-grams

the first BW-grams in the given order are matching or not,

and then count the number of hash orders in which they are realigned,

say R equal to in this example three.

If only two hash orders are aligned,

we say that these two things are matching to each other with two by three probability.

If only one is aligned,

if only one alignment holds then we say that,

"Well, there is less chance that these are similar to each other."

If none of them are aligned,

then we say, "Well,

there is really not much chance that these two strings are aligned with each other."

If all of these sequences they agree then we can't say, "Well, you know what,

there is good evidence that these two strings might be very similar to each other.

So why don't we go and pay

the more expensive distance cost to

actually evaluate how similar these two things are to each other?"

If however, the sim is zero by three,

then we can say, "Well, there's not much chance that these are aligned.

So there is actually no reason to go and pay the price of expensive edit distance to

compete the precise value of the edit cost between things p and q."

So, based on the value mean sampling similar to value that we obtain, we can decide,

whether to not consider the two string or whether to consider

them as potential match and go and compute the more expensive edit distance.

So, it can be used as a filtering tool.

So what did we learn?

In the previous lecture and this lecture,

we've learned that edit distance is a way to compare strings.

But we have also learned that edit distance can be very

costly for matching strings that are long.

We have learned in this lecture to all distance algorithms,

cross-parsing and compressing distance,

that can be used to approximate the edit distance comparison.

So instead of paying order m times m edit distance cost,

we could use some cheaper algorithms to compute edit distance.

Then we have learned in this lecture,

the use of W-grams also known as n-grams,

commonly referred to the n-grams to filter out unpromising candidates,

quickly so that we don't have to pay the price of

more cost and distance competitions such as

edit distance or cross-parsing and compression distance.

These are cheaper than edit distance but they are still,

they can still be expensive and

this W gram based algorithms essentially enable us to filter

out unpromising scenarios so that we can provide the potential matches to

the users in real time or in almost real time for effective exploration of sequence data.

Thanks a lot for your attention.