Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

So, now, we can basically do perfect search or subsequent search.

But note that in all the examples that we had,

in all the problems that we had,

so far we assumed that we were given the data sequence in advance.

We were given the data sequence in advance,

so we could create a try on it,

or we could create a suffix tree on it,

or we could create a suffix array on it.

But there are situations

in which we are not given the data sequence in advance.

So for example, let's assume that we have a sensor,

say this is our temperature sensor,

and this temperature sensor is

recording the temperature over time or the temperature events over time.

So its generating its course to continue the generating sequence.

What I am interested in is if at

any given point the sequence,

is it evolving, is it growing?

If at any point in time,

say the pattern ABBA occurs,

then I want the system to trigger an alarm.

Let's assume, because that might correspond to a fire.

Or let assume that is a set temperature sensors,

but this is a sensor on a car engine,

and the car engine let's assume it is recording vibration on the engine motor.

What I am interested in at that point is if what I am

recording constantly if it ever matches my query pattern Q,

then I want to trigger an alarm because

maybe the vibration pattern on the engine correspond to an engine problem.

So note that in this application,

I am again given a data sequence.

I am again given a query sequence.

However, in this case my data sequence is not given in advance.

My data sequence is coming to me in a streaming fashion,

but I have my query sequence given in advance.

In fact maybe I'm not given one query sequence or multiple queries sequences,

because maybe there are different ways a car engine can fail,

which means that as my data sequence is evolving,

I need to check constantly for every new sequence element that is arriving.

I need to check constantly If the most recent history matches any of my query patterns.

So this actually we call it triggering.

So we want to see if we can do subsequence matching,

and trigger efficiently, quickly any the patterns of interest to the user,

bring any patterns of interest to the users attached.

This is very important in decision making with real-time data,

because we want to be able to point to the user,

visualize to the user, interesting patterns.

So the question is actually how do we implement that?

Well, given a pattern, given a sequence,

obviously we can implement this very naive way which is for every,

so the only way of implementing this would

be for any given new element in the sequence I can extract

the most recent history four characters sequence history

and compare this four character sequence history to my query pattern.

Now that is actually what I'm doing at that point is that I'm

constantly extracting windows of interest,

and I'm constantly matching the window interest to the query pattern.

When I get a new symbol,

I again extract the new window of

interest and I'm again matching that to the query pattern,

and I'm repeating this again,

and again, and again, and again.

Obviously, this can once again be very inefficient if my pattern is long,

and if I have many such patterns of interest in my database.

So let's answer the question become,

can I do this efficiently?

I want to do this efficiently because I want to bring

such patterns of interest to the user without any delays.

Because this might indicate a fire,

which I need to find very quickly or it might as an indicator of

a car engine failure which once again I need to

take very quickly and I need to highlight to the user very quickly.

Right? So the necessary question become; well,

can we create a data structure

on the pattern instead of creating a data structure on the sequence?

So far we created our data structure on the data sequence.

But now our data sequence is arriving in an incremental way.

I don't have my data sequence in advance.

So then the question becomes;

Can I create a data structure on the pattern rather than on the data set?

Well, the good news is yes you can do that,

and we will see a couple of examples of this.

The most well-known approach to this problem

is called Knuth-Morris-Pratt algorithm or KMP data structures.

We'll see that basically if we use this data structures,

the cost of finding match is essentially going to

be constant time for each new symbol that we are observing.

So how will this work?

The way basically this is going to work is that given a quite a pattern of length M,

we will create a finite state automata

that represents our pattern.

We will use that finite state automata to constantly

check whether the trigger condition occurs or not.

So let's see an example.

So let's assume that we are given a query pattern, ABCADAB.

So we are given a query pattern of length: one,

two, three, four five, six, seven.

So we have seven symbols solid in our query pattern.

So what you will do is we'll basically take

this query pattern and represent it in the form of a finite state automata.

So, the finite state automata have a start position,

and has a successful trigger position,

and for each symbol that we have in our query pattern,

we have a note or a state in our finite state automata.

So, the basic structure is very simple.

So, a, b, c, d, a, b.

If I see that,

I should trigger an alarm.

So, indeed starting from start,

if you see a, b,

c, a, d, a,

b, you reach to the trigger condition.

So, that is basically the- that is a simple sort of straightforward part of the automata.

Let's assume that basically our data sequence evolves as follows.

Let's assumes that we have an a,

we have a b,

we have a c, we have another a,

we have another b, we have another c,

we have another a,

and d, and dot dot dot.

Let's see how this data structure would work.

We are at the beginning,

we're at the start position.

We see an a. That's great.

That takes us to the state a.

We have seen a so far. That's good.

So, we are about to match.

The next symbol is a b. That's great.

Looks like our symbol is- looks like our pattern is matching.

So, we have already seen a and b.

So, we are at the state where we have seen b.

So far so good. The next symbol is a c. That's beautiful.

We are going to next state in our symbol,

in our data structure c. So,

we have already seen three of our seven characters sequence.

Beautiful. The next sequence is an a.

Okay, there's a problem.

Why? Because our sequence- this is not.There's no problem.

Okay. Let's see an example how this data structure can be used to trigger patterns.

Let's see an example, data sequence.

Let's say our example data sequence starts with a d. Continue to c,

then a b, then a,

then b, then c,

then another a, then another b,

then another c, then another a,

d, a and b and dot dot dot dot.

Let's see what will happen. So, at the beginning we see a d,

and we are at the start pole.

d is not an a.

So, it means that basically,

we will basically say at the start position.

The next symbol that we see is a c. It is still not a.

So, we'll stay at the start position.

The next symbol that we see is a b.

That's fine. So, we are not seeing our trigger pattern.

So, we are still at the start position.

The next symbol in the sequence is an a.

This is the first symbol of our trigger pattern.

So, we'll move along this edge and I will arrive to the state a.

At this point we have seen a.

The next sequence is a b. Oh,

looks like there's a chance our trigger pattern maybe matching,

because we've seen not only a but also to b.

The next sequence is a c. C is the next sequence in our pattern.

So, we have not only seen a and b,

but also have seen c. So,

three of the seven event in our sequence have been matched so far.

The next sequence is an a. Oh,

looks like our pattern is indeed evolving,

because from c the next symbol is an a.

So, I arrive to a position.

So, we have seen a, b, c, a.

So, we have seen four of our seven symbol sequence.

The next sequence is a b. What does this mean?

It means that our pattern is not matching yet,

because if our partner was matching,

the next sequence should have been d. But the next sequence in our data is a b.

So, that means that we need to go back.

How how far do we need to get back?

Well, one possible is that we can go all the way back to the start.

However, do we need to go that much back in history?

It turns out that we don't.

Why? Well, it is true that we have seen a b,

but if you look at here,

at this point we have not just seen b,

but we have really seen an a, b.

An a, b is the first two sequences on our target pattern.

Which means that from this position,

instead of jumping all the way to the start,

we need to jump only to this position.

This should've been a b. I apologize.

This should also have been b. I apologize for that.

So, from a I jump back to the state that I have seen ab.

This essentially saved us having to revisit this a and this b again.

So, this is actually where the savings of the Knuth-Morris-Pratt data actual comes.

Let's see what happens next.

We have seen a c, c took us to the states c again.

Dummy c and a, that takes us to the state a,

then we see a d. Well,

that takes us to the next state d. We see an a.

That takes us to the state where we have seen a,

and at that point we see a b and b takes us to our trigger state.

Essentially what happened is that using our finite set automata,

we have triggered our target pattern at this position.

The saving that happened in this data structures for the sequence is here.

So, instead of jumping to the very beginning is a failure.

By jumping to this point,

I essentially saved lookups,

two lookups at this point position.

By doing this two look-up savings,

essentially what I can guarantee is that

the search can be done in a linear time of the data structure.

What about if we are not given a single pattern but we are given multiple patterns.

Can we also have savings across multiple patterns?

Now we know that basically using KMP data structure,

we can leverage redundancies within a single query pattern.

But what about if you are them multiple patterns,

can we also leverage redundancies across multiple patterns to reduce the search time?

It turns out that we can actually do that and

the Aho-Corasick Trie is a data structure that actually enables us to achieve this.

So let's see how the Aho-Corasick Trie works.

In this case, let's say that we are given a query pattern ABCDAB,

and the query pattern CADD,

and let's assume that we have

the data sequence DCABCADAB.

Lets assume that this is our data sequence.

Lets see how our Aho-Corasick Trie would work.

Well, we start with the symbol D. So we are at the start position.

D is neither A nor C,

so we'll state in our start pole position.

The next symbol C. Well,

it looks like we might be triggering the second pattern.

We don't know yet, but there is a chance we might be triggering the second pattern,

so we go to the C position,

the state where we have seen C so far.

The next symbol is an A.

That takes us to the A state.

So we have seen CA of the second pattern.

We might be triggering the second trigger pattern.

Lets see. The next symbol is B.

Note that the next symbol for

the second pattern should have been D. So there's a failure.

So, if I were implementing it naively,

using naive brute force search at this point,

I would jump back all the way to the start.

However, note that, the sequence AB that

we have actually corresponds to the beginning of the first trigger pattern.

Which means that at this point,

instead of jumping going back all the way to the start,

I can actually jump using this fall on BH to the state where we have seen AB.

Great. So, I have short cutted to the AB state of my first pattern.

The next one is a C. I have seen ABC.

The next symbol is A, I have seen ABCA.

The next image is a D, I have seen ABCAD.

The next symbol is an A,

ABCDA, and the final symbol is a B, ABCADA and B.

So what happened is that I managed to trigger at this position the first pattern.

I matched the first pattern,

and I matched the first pattern without having to go

back to start and finding A and B from scratch.

Essentially, what I have done is that I have leveraged the redundancy of

the sequence AB between the second search pattern and the first search pattern.

This jumpier essentially enables me to find

AB and match it to this AB leverage the redundancy.

Not that there can't be other redundancies as well.

For example, if I had come ABCAD,

let's assume that my sequence was ABCAD,

so I have all of this but ABCAD,

and let's assume that at this point I see another D. Now that it's

clear at this point I cannot go this way because

the next symbol is not in A it's a D. However,

what I can see is that here in my sequence I actually have CADD

which is my trigger sequence pattern for my second pattern.

In fact this means that from this position in my first sequence I can jump

directly to the trigger state of my second pattern.

So, at this point,

instead of redundantly going and revisiting CADD patterns,

I can directly jump and trigger my second pattern.

So this is the idea of the Aho-Corasick Trie,

and given any sequence of patterns,

you can essentially create a combined finite automata,

where the redundancies between the different patterns within single pattern

and across patterns can be leveraged to efficiently identify trigger conditions.

Note that the beauty of these two data structures is that,

I never have to go back in my sequence.

I always consume a new sequence,

and which every consumed new sequence,

I always go forward on my lookup,

I always follow an edge and I always follow the next sequence.

That means that basically I will never need to revisit a sequence that

essentially means that the cost of matching

independent of how many pattern I have in my database,

the cost of my matching is O(N) length N is the data sequence length.

It means that for every symbol in my data sequence,

I do only one lookup, which is great,

which is a very efficient data structure which essentially means that

using this approach in general I can efficiently identify trigger conditions,

and I can bring to the attention of the decision makers.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.