0:03

Today we're going to talk about tries, which is a data structure for searching

Â with string keys that is remarkable effective in enabling us to deal with huge

Â amounts of data that we have to process nowadays.

Â Just as a motivation, when we left off talking about searching algorithms, or the

Â symbol table implementations here's where we were.

Â The best algorithms that we examined were balanced red, black, binary search trees,

Â in hash tables. And With respect to the basic search,

Â insert and delete operations. Red black BSTs we're able to guarantee

Â that we can get those done in logorithmic time, actually, pretty efficiently.

Â And for hashing, we'd get'em done in constant time under certain assumptions.

Â That we could have a uniform, hash function.

Â And also, for hashing, what we need to compute a hash, code for binary search

Â trees, we use, a compare to function. And another difference is, binary search

Â trees, we can support more operations. Than we can support with hashing ordered

Â operations. For example, getting the rank of a key in

Â the tree, and other things. And those are terrific algorithms.

Â And we looked at plenty of go, good applications of those algorithms.

Â But still there's the question could we do better?

Â And the answer is that yes, we can do better.

Â As with string sorting, we can avoid examining the entire key.

Â Which is going to give us algorithms that can compete and even beat these classic

Â algorithms. So to get started we are going to

Â articulate in API for symbol tables that specialize for the case when the keys are

Â strings. So simply, we just add the modifier string

Â to our standard symbol table EPI. And we can take a generic value just put

Â the keys or strings. And then we take all the methods that we

Â articulated before. And for key type, instead of generic, we

Â use string. And this is going to allow us to develop

Â efficient algorithms that take advantage of properties of strings.

Â Our goal is to get, some algorithms that are even faster than hashing.

Â And maybe, more flexible than BSTs. And we're going to be able to achieve both

Â those goals. So, this again is the summary of the

Â running times for when, in the case that the keys are strings for red black BSTs

Â and for hashing. And let's take a look at those so if L is

Â the length of the string. And n is the number of strings and then

Â also the rate x is r is going to be a factor later on but its not going to be

Â for these two. Then for hashing for every operation we

Â have to compute a hash function, which basically means looking at every character

Â in the string. Actually in Java string hash functions are

Â cached, so sometimes there's efficiencies cuz you only have to comp, and they're

Â mutable, so you only have to compute the hash function once.

Â And then this is roughly the amount of space that takes into account the table

Â and the string size for red black BSTs, the analysis is a bit more complicated.

Â For a search hit, you have to look at the entire string.

Â Given a entire key to check that every character's the same and then the

Â comparison's depending on the nature of the keys for a typical case though its

Â something like log squared N. we had log N comparison's but usually you have to look

Â at something like a log N characters in the key in order to get down the tree.

Â For a search miss you might be able to find out before you get to the end of the

Â tree and sing a for an insert so the running times are something like this.

Â Then here's some experimental evidence that, that we, we can use to test out

Â these analytic hypothesis. And we'll just look at two examples one is

Â the entire text of Melville's Moby Dick which we've used before and that's about a

Â million characters. And maybe 200,000 strings.

Â And add to this 200,000 strings, about 32,000 are distinct.

Â And then this, There's another file of actors from the

Â Internet movie database that's much bigger maybe tourism magnitude bigger.

Â And it's got maybe about a million different words.

Â So we'll want our algorithms to do better than, or at least as well as red black

Â BSTs in hashing on these data sets. I think, you know, our test is to dedupe

Â these data sets. So that's our challenge.

Â Efficient performance for string keys and try to come up with algorithms that can

Â compete with the best classic algorithms. Now in order to do that we're going to

Â look at R way tries. That's a particular data structure, that

Â was, invented actually really quite a while ago.

Â This data structure, dates back, to the 60s.

Â And, the first thing to know about it is that, the name is based a little bit, on a

Â pun. It actually is the middle letters of the

Â word retrieval, but we don't pronounce it tree because then we couldn't distinguish

Â it from binary search trees, so we pronounce it try.

Â And this is a confusing fact of life that we've been living with for many decades

Â now with this data structure. And let's look at what a try is and we'll

Â look at some, examples before we get to the code.

Â So, for now, we're going to think of a try as some nodes.

Â A tree structure with nodes where we store characters in the nodes,

Â Not keys. And the other thing is that each node has

Â R children, where R is the rate, it's the number of possible characters.

Â And there's g-, the possibility of a child for each possible character.

Â Now in the standard implementation, this is going to involve on all links.

Â We have to have a placeholder for each possible character.

Â So there's a lot of null links and tries. And we'll get back to that in a minute.

Â But to get the concept, we'll use this graphical representation, where we have

Â characters and nodes and we don't show any null length.

Â And the other thing is, remember a symbol table is associating a key with a value,

Â so this try is built for this set of key value pairs.

Â And what we do is we. Store values in nodes that correspond to

Â the last characters in keys. And we'll look, that's what these numbers

Â here are. And we'll look at a little more in detail,

Â on how this try is built in just a second. So that's the basic idea.

Â We're going to have string keys. We're going to associate values, and we're

Â going to use this tree-like, tree data structure but we're not going to draw the

Â imply null links for now. So, for example, in this trie so the root

Â re-, represents all strings. And, coming out of the root is one length

Â for each possible letter. In particular in this try, the middle link

Â is the link to the, to a sub try that contains all the keys that start with S.

Â Then we go further down. Each time we go by a node, we pick off a

Â letter so that this length, for example, goes to the try for all keys that start

Â with s followed by h followed by e. And so forth.

Â So now all, out of all the keys that start with SHE, actually one of them, the one

Â that just has the letters in then ends, has a value associated with it.

Â So when the node corresponding to the last letter, the E, we put the value zero.

Â And so that's, how the try is going to look.

Â So just given that definition, then let's look at, how to do search, in a try.

Â All we do is use the characters and the key to guide our search down the data

Â structure. So, let's say we're looking for the key

Â shells. So we start at the root, and we go down

Â the S link, since we started with an S. Now our second letter is H, so we look to

Â see if there's a non-null link that, corresponding to H, and in this case there

Â is. Now our next letter is E, so we look for

Â an E. Now.

Â No need to examine the value here because well we haven't looked at all the letters

Â in our key yet. So now we look for L now we look for

Â another L and then finally we find the S and when we find the S we look to see if

Â there's a value there. In this case there is a value and so

Â that's what we return the value associated with the last key character.

Â So that's a typical search hit in a tri. Another way is...

Â So, that node was down at the bottom but if you had.

Â We're doing a search for SHE, you follow the same path.

Â But now when you get to the E, that's the last character of that key and so we just

Â know how to return that, value associated with that node.

Â That is, the search doesn't have to go all the way to the bottom.

Â It might terminate at an intermediate node.

Â And we just return the value associated with the node corresponding to the last

Â character in the key. What about a search miss?

Â Well, there's two cases. It's for a search miss.

Â So one of them is we've followed down the tree picking off the letters in the key

Â one letter at a time as, as before. And then when we get to the end of the key

Â we take a look to see if there is a value. In this case there's no value associated

Â with the last key character that mean there's no key in the data structure

Â that's been associated with a value. So we just return no.

Â Or, the other thing that can happen is, we can, reach a null link.

Â So, our key now is S, starts with S, and then H, and then E, and then L, and now

Â our next letter is T, so we look to see if there's a non null link coming from this

Â node associated with T. And in this case, there's no such link, so

Â we return null. That's evidence that, that string, there's

Â no key associated with that string in our data structure.

Â So that's a search miss. Iii.

Â Now, what about insertion? Well insertion is also pretty simple.

Â So we follow two, two rules. One is again, we go down links

Â corresponding each character in the key. If we come out on a link, we create a new

Â node that's associated with the character that we're on.

Â And just keep doing that until we get to the last character of the key, in which

Â case, we put the value. So if in this try, we're supposed to

Â associate the value seven with [inaudible].

Â We follow, our path from the root to s, 'cause our first letter is s, and then

Â h.'Cause our next letter is h. And then a O, we had an, [inaudible] kinda

Â ran into a null link. So we have to put, a node there, with the

Â character, associated with the character O.

Â So in later searches for this key, we'll be able to find it by passing through that

Â node. And now we move to the next letter, and

Â there's no node again. In the next letter, there's still no node.

Â Iii. When we get to the end, then that's the

Â last character in the key, and we put our value seven.

Â And now we've modified the data structure, adding the nodes that are necessary for a

Â later search to go through, character by character and find the value associated

Â with that key. So that's an insertion into a try.

Â Just to cement all these ideas let's do a demo of how that tree was constructed from

Â scratch. So we have a null try.

Â So we're going to start by putting the, associating the value zero with the string

Â SHE. So we start at the root node, which, and

Â I'll try, it just has that one node. Create a new node for S, create a new node

Â for H, create a new node for E. Associate zero.

Â So the key is the sequence of characters from the root to the value, and the value

Â is in the node corresponding to the last character.

Â This try represents the symbol table that contains just on string SHE, and

Â associated with zero. Every other string, if you search for any

Â other string in this try you'll either end in a node that doesn't have a value or

Â you'll end at a null link. All right, so now, let's say we put in the

Â key S, E, L, L, S and you can kinda tell where it's going, we made room for it in

Â the try. So we go for S, and now the second letter,

Â E, corresponds to a null link, so we create a new node.

Â And similarly we go through and create new nodes for each of the remaining characters

Â in the key and then associate the value one at the end.

Â So now we have a try that has two keys in it, SHE and SELLS.

Â Now our next is to insert SEA. So now we have S, we already have a node

Â corresponding to E. And so now we just have to create one new

Â node, the one corresponding to A. And put our value two there.

Â And now, we're going to put SHELLS in. So we already have the S, already have the

Â H, already have the E, so now we have to add nodes for the last three letters in

Â that string. And associate the value three with it.

Â Now we're going to put finally a key that doesn't start with S.

Â So that means we create a new node corresponding to the first letter of that

Â string, I, N to the other letter two. And then associate the value four.

Â And here's another one that doesn't start with s.

Â So we create new nodes corresponding to each one of its characters, and associate

Â the value five with the last one. Now here's SEA.

Â So this is, remember, our symbol table API is associative, which means that if we get

Â a new value for a key that's already in the table, we overwrite.

Â The old value with the new value. That's the way, the convention that we've

Â chosen for symbol tables. So that is easily done with tries as well.

Â Now here's one more key, S, H. And now we have to add a new node because

Â there's no node that's a child of H that corresponds letter O.

Â So we have to create new nodes for O, R, and E, and associate the value seven with

Â it. So that's a tri that has precisely eight

Â keys. If you look for any one of those eight

Â keys you'll get the value. If you look for any other string you'll

Â either come to a node that has a null value or go out on a null link and so the

Â search would be unsuccessful. That's is a demo of tri construction.

Â 16:20

Now that you have a basic idea of how, what tris are and how they work, let's

Â take a look at what's needed to implement them in Java.

Â The key idea in the tri representation for implementation in Java or any computer

Â language is that. Instead of representing a node as a character in the node, what we

Â do is represent the links as an indexed array with one entry for each possible

Â character. And the idea is the characters are

Â actually. Implicitly defined by link indeces.

Â So just for example, we have this small try that we built over here.

Â In this case, the root has only one node below it.

Â That's for all the keys that start with S. The way that's represented in real

Â representation, and this is for efficiency.

Â And it's the way that tries get their efficiency.

Â Is we have one link corresponding to each possible letter.

Â And the only one that is un-null is S. So the character S is defined implicitly

Â by the fact that, that link that [inaudible] that, that corresponds to S is

Â not null. Over here there's from e to a there's two

Â links coming out of E. And the only way that we represent A is by

Â having the first link be not null. If the link corresponding to a letter is

Â not null, then it corresponds to having that letter in the node that it points to.

Â So in this case we have E connected to A and L.

Â So we have the entries corresponding to A and L filled with non null values.

Â So you can see immediately the correspondence between this tri the way

Â we've been drawing it and the job representation of it.

Â Where each node contains 200, or r links if there's r different letters.

Â And letters are implicitly represented by non [inaudible] links.

Â Now, you just go in the node. So for example, I, in this try, S, E, A,

Â but which is easy to follow down through the try.

Â We're looking for an S. And S is the twentieth letter in the

Â alphabet. Or so.

Â We index into the twentieth position and that takes us right to S.

Â And E's the fifth letter, we take the fifth link.

Â Then A's the first letter, we take the first link.

Â So we can use the character as index into the array of links.

Â To quickly travel down the tree. Then when we get to the last character, we

Â can check the value at that node that's stored in the node.

Â One slight complication in the implementation that we encountered before

Â with hashing algorithms and other things. We're going to need arrays of nodes.

Â That's what our lengths are. So we can't have any generics with nodes,

Â even though I would like to. So we have to declare the value to be a

Â type object, and then we'll have to cast it back to whatever type it should be,

Â when we return it. And we'll see that in code.

Â Other than that little complication. This is quite straightforward

Â representation of tries. And it leads to a very easy

Â implementation. So the keys and the characters are not

Â explicitly stored. They're, they're implicitly, because of

Â the links. And that's a, a very important point to

Â think about when it comes to, implementing using tries.

Â 20:22

Given that representation, this code for implementing try, simple table in Java

Â almost writes itself. So it's a [inaudible] implementation in

Â this slide has the implementation of put using the same recursive scheme that was

Â used many other times in building trees the what are the instance variables?

Â Well we declare artery 256 as usual in our string implementations where working with

Â extended asking and then we have one instance variable that matters and that is

Â the root of the [inaudible]. So tries begin with, start, all tries

Â start with a null node. So first thing we do is create a new node

Â and put a reference to that node in root. So our empty trie consists of a node

Â that's got a. Remember, when you create a new node we

Â Build an array of R links to nodes. And at the beginning those'll all be empty

Â links. Or all null links.

Â So the root points to a node that has 256 null lengths.

Â Now to put. Or to associate a key with the value in a

Â trie. We use this instance method that calls

Â recursive method. Again, the same way that we've done for

Â other tree construction schemes before. So we take the recursive method, takes a

Â node as argument and returns a node. So it takes a reference to a trie and

Â returns to, a reference to the trie that it constructs.

Â After a associating the key with a value. Just to get started suppose that our first

Â key has just one character. So in that case, we would call.

Â Another recursive put for the root, with our one character.

Â And so our one character key. And call the recursive routine.

Â Now our node is not null. It's the root node.

Â So and our key is length one. And we, called it starting at zero.

Â So the first thing that we do is pick out the [inaudible] character of our, Yeah key

Â so zero of character which is our one character and I guess its an index

Â whatever letter might happen to B. Say its B then C would be two that's sort

Â of thing and then all we do is recursively are follow that link in ser.

Â Our, associate our key value in the try pointed to by that link.

Â And then take the link that we get and put it back into that link out of the root

Â node. So the next call there's nothing in the

Â word dot next it's a, it's null so the next call we get null so we create a new

Â node. And then that new node this point we've

Â called with D plus one moving to the next character.

Â So now we have D equals one which is equal to our key dot length.

Â So we associate the value in that node with our node and we return it.

Â So that return one level up will set the length of the new node in the appropriate

Â entry corresponding to the root. Again once you've learned our normal

Â recursive way of structuring building tree structures this code is very natural.

Â So for a longer key I'm again thinking recursively we find the if we're suppose

Â to insert the key associate the key with a value, and we're working on the

Â [inaudible] character. We pick out the, The character at the deep

Â position. We use that to index and to be in link

Â array of the current node and then that's the link that we set when we do a put of

Â the new node. So when we start with a longer string and

Â a perfectly empty tree we create new nodes all the way down and then put their links

Â to their parents all the way up. In this recursive structure.

Â And it's a very simple and natural recursive code.

Â So that's the put. Now that takes a little study but not so

Â much once you're use to our standard recursive way of producing tree

Â structures. And get is even simpler.

Â So contains and gets. So our scenario's standard convention is

Â to return null if we have a key that's not there or that contains and the get

Â function is a again recursive. Procedure that will return in reference to

Â the node, and if that's null, we return null.

Â Otherwise, we can get the value out of the node return by the recursive procedure.

Â And then remember we had the omega value in nodes of type objects.

Â So we have to cast that back to the type value.

Â And the recursive get is very simple. To find the node that contains the value

Â associated with a given key. And we're working on the deep character.

Â And we're currently on, node x. We just, return null.

Â Effects happens to be null. That means it's not there.

Â If we're, at the last character in the key, we return our current node.

Â Otherwise, we get that [inaudible] character.

Â And we use that to index into the next array of the current, node.

Â And recursively called the get routine, for, that node moving, one down the tree.

Â And incrementing our key pointer by one. Extremely simple, recursive code to do the

Â tri implementation. So what about the performance?

Â Well in a cert chip we simply go down the tried examining all L characters just

Â using each one to index into an x-array at some note.

Â And then go down to the end to, to see if there's a value there.

Â Search miss we might have to go all the way down to the last character.

Â But we could also just have a miss match on the first character and find a null

Â link right away. And actually.

Â Typically, in a try. In typical applications, we only examine a

Â few characters. So right away, you can see, by thinking

Â about a search miss, that try algorithms can be sublinear.

Â We can find out that a key's not in the tree, by only examining a few characters.

Â If we don't have any. Strings in our try that begin with the

Â same few characters as our search key, then, it's not there.

Â That's a typical case. Now the downside of tri performance in

Â lots of applications is the amount of space used.

Â There is the problem that every node's got R lengths in it.

Â And particularly, down at the bottom, the leaf nodes that correspond to the last

Â characters and keys and, the prefix in the key in the try that's going to be null

Â lengths. Now it is possible in a huge try with lots

Â of strings that are short and sharing common prefixes to actually much less

Â space then this but be ardenal links at each leaf is a real problem in some

Â applications so we're going to take a look on how to deal with that.

Â So the bottom line is for symbol tables with relatively small numbers of string

Â keys where the amount of space used by the [UNKNOWN] is not a problem.

Â Then we get very fast, search hit. And even faster search miss but we burn up

Â some space. That's the bottom line about try

Â performance. And, just a typical application.

Â Maybe you get a job interview question about, what data structure you use for

Â spell checking And then, so, and this is just an example, to show, how effective

Â tries can be for such an application. Where all the words are three letters.

Â And the solution is, build a 26 way try. So spell checking, usually, there'll be

Â pre-processing to strip out everything but the lowercase letters that make up the

Â word. So you can build a 26 way try that, will

Â immediately tell you whether you have a misspelled word or not.

Â Another interesting thing about prize is that, deletion is, very easy, what do we

Â do to delete a key value pair, well you find, the node corresponding to key and,

Â