Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

Our first task is prefix search.

We are given a pattern,

in this example tab.

We are given the pattern tab.

We are also given a set of database strings,

and what we are interested in finding

those strings that start with the sequence tab efficiently and effectively.

So, what is the [inaudible] approach for this?

Well, the [inaudible] approach is very simple.

So, take your pattern,

let's call this pattern Q,

and you can basically simply start, okay,

opening each series one at a time in the database, and starting comparing.

So, you start with tab,

we compare the beginning of the sequence here, tab is found.

Great. We have a match.

Then, we open the next series.

The next series let's assume that starts with that cat.

Note that basically, as we look at it we see that basically cat is not matching tab,

so we eliminate this,

and so on and so on.

So, this is a very simple operation to implement.

Of course, the problem is that basically,

if this event sequence is long,

and if we have a lot of strength in our database,

this is going to be very inefficient,

and this is essentially means that it will be very difficult to use

this type of search in a truly exploratory excess.

We want the support exploration,

we want these to be implemented in real time,

and this type of implementation obvious is not going to be real time,

and, obvious we cannot use it for exploration.

So, what we need is fast and efficient implementation of prefix search.

The question, how do we support that?

Well, so the tool that is available for implementing prefix search efficiently,

is called the trie data structure.

So, the trie data structure essentially is a tree.

Here, we are seeing an example, it is a tree,

and the tree has nodes and edges,

like any tree data structure.

In this case, the nodes of the tree,

consists of the alphabet,

or dictionary of events we are interested in.

For example, in this case,

our dictionary of events are labeled from A to Z.

Essentially, basically as you can see here,

each and every node in

my trie data structure essentially has its decision parameters R events.

So, how is the data organized in the trie data structure? Let's see an example.

So, let's assume that in our database we have the sequence car,

and I want to basically support prefix search efficiently for this sequence.

When I index this in my trie data structure,

effectively, what I need to do is,

I need to basically put it in a place in my trie data structure,

such that the string can be

found by following the event sequence.

In this case, for example,

if I start at event C for all of the pointers then,

follow the pointer A, I will find car.

That's great. What about cathy?

This is another sequence that I have in my database.

As you can see here,

cathy can be formed by starting from event C,

following the corresponding pointers,

arriving to symbol or event A,

and following the corresponding pointers,

and then following event or symbol T,

and cathy is located at this point.

So, this is essentially a data structure where if I am trying to find cathy,

I can basically decide the edges that I should be following very efficiently.

So, how do we organize or how to organize the data here?

Note that, why did I basically- So when I'm putting car and so for cathy,

I had to basically create,

I had to follow a path of two edges,

followed by another look up.

Whereas for car, I had to follow only one pointer followed by a database look up.

Now, so what is the difference?

What exactly is the difference between cathy and car in this case?

So, this is really easy and basically I think it is quite easy to see.

So, in our database,

we have car, cathy,

and in this example cattle.

Note that, here in these,

if you take a look at the these three strings,

you'll see that car,

cathy, and cattle share C and A.

So, they basically all have the same prefix.

On the other hand, car has an R as the next symbol,

but as cathy and cattle,

share yet another symbol T. So, what does this mean?

This means that if I am basically trying to find,

in this case car,

cathy or cattle, all of them will have the same prefix CA,

which means that basically, they all need to follow the symbols C followed by a symbol A.

On the other hand, if I need to differentiate cathy and cattle,

they share the prefix CAT and the next symbol H and L,

is a differentiating symbol.

For car, R is a differentiating symbol,

for cathy and cattle,

H and L are the differentiating symbols.

This means that, if I am creating a data structure for this,

where the prefix are shared,

I can put car at the point where the prefix is CA,

and R, is the differentiating symbol.

Whereas, when I am basically replacing cathy and cattle,

I place them at cat, the prefix cat,

with L and T. I

apologize cat and H being the differentiating symbol.

So, this way I'm actually able to basically search,

when I search for car,

I can simply find car by just simply

two lookups and two node visits into my trie data structure.

When I'm searching for cathy,

I need to visit C A T and I can find cathy.

If my search is cattle,

I actually have to do a little bit more work.

My data structure has one extra node in order for me to locate cattle in my database.

Why is that the case? Well, the reason for that is that,

in my database, I have another string, catapult.

Let's basically write these down.

For cattle and catapult.

As you can see here,

cattle and catapult have the same three symbols as their prefix shared.

The differentiating symbol for cattle is an L,

and the differentiating symbol for catapult is an A.

Which means that, if I am searching for cattle or if I am searching for catapult,

I need to fall off first to C then A then T, CAT.

That basically becomes my prefix that's shared by

both of these symbols and the differentiating symbol,

L and A is going to be located in the following node in my database.

As you can see here,

catapult is accessible to C A T L. In contrast,

cattle is accessible through following node C A T and L, catapult and cattle.

So, what about the three strings that we had in our example.

Remember, in our example,

we were basically looking for the sequence stop,

and in our database we had had table, tabular, and tablet.

How would these be accessible?

Well, let's basically try to figure it out together.

So, in our database,

we have table, tabular, and tablet.

Let's write these down.

Table, tabular, and tablet.

Note that, all three of these have T as the starting symbol.

So which means that, basically when we are searching for all these three symbols,

we should be following the pointer T. Next will be A,

and they all share the symbol A.

Which means that basically,

the next symbol here that they share will be A.

So basically, I will be following pointer T and then I'll be following the pointer A.

Finally, they all share this symbol B as the prefix.

Which means that, basically for all three of them,

I will be following the pointers associated with the symbol B, T A B.

Then I see that basically,

they have from that point on different sequences,

which essentially means that basically the table, tabular,

and tablets will all be accessible under this part of the trie data structure.

Note that essentially, the target is searcher gives me

an efficient way to find all sequences in my database that have a given prefix.

If my prefix is T A B,

I start from the root, I look at T,

then I look at the next symbol in my sequence A,

then I look at the next sequence in my symbol B,

T A B and I know that I will find

all sequences that have this prefix under this part of the data structure.

So, this is called a trie data structure.

As you can see here,

the course of the search,

the number of nodes that I have to visit to be able to

locate the strings that I'm interested in,

essentially is the number of symbols I have in my query pattern.

In this case, in my query pattern I have three symbols,

and in order to be able to locate all the strings that starts with that pattern,

I had to follow three pointers.

So, this is the trie data structure,

and this is basically what they usually use to support a prefix search efficiently.

When I am using the trie data structure,

given a pattern, for example T A B,

I can efficiently prune the rest of the database so I

would not need to consider those strings when I am searching.

So, that is the advantage of the trie data structure.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.