0:14

So again, it's the basis of understanding performance, and lots of large-scale and

important applications that are part of our infrastructure right now.

One of the big questions is, how much space does a trie take?

Well, it's the total number of external nodes.

And then, if you have the number of strings you're representing,

then you figure out the number of extra nodes.

That's the number of void nodes is really the extra space that you're using.

What about the expected search cost?

Well like with BSTs, that's the external path length.

That's the distance from the root to all the external nodes, the average distance.

So average external path length this trie is 3.92.

That's the search cost.

Now you notice that even if half your nodes are void,

that's only going to add half to this extra cost.

So that's part of the reason that tries are effective.

You can have plenty of void nodes.

It's not going to affect the search costs all that much.

And then leader election, that's the length of the rightmost path,

as I talked about before.

It turns out we can analyze that all of these.

I'm going to talk about external path length, and the others will be exercises.

And again, the easiest model, the usual model,

is to think of the trie as built from inserting N infinite random bitstrings.

Eventually they're going to differ.

And where they differ, that's where you're going to get a non-void node.

That might represent an infinite tail, but we don't care.

The bitstrings distinguish themselves.

Eventually the trie represents them,

according to distinguishing them on the leading bits.

And from an analysis standpoint, it makes for a reasonable model.

2:17

Because we can assume at every node that we randomly go to the left or

to the right.

And this model fits real data perfectly well when it's been studied.

The starting point, if you think back to the analysis that we did for

binary search trees, is very similar.

It's just that we have a trie of size N.

There's k on the left and N- k on the right for some value of k.

The difference is that, two differences.

One is what's the probability that there's k notes on the left,

3:00

which indicates a binary search trees, which just 1/N.

And the other difference is that you might have no nodes on the left and

N nodes on the right.

That's just a technical thing.

That definitely comes directly from the definition of tries, but

it makes the recurrence just a tiny bit trickier.

3:21

So that's a recurrence down at the bottom.

What's the probability that, when you have N random strings,

what's the probability that k of them would start with a zero bit?

It's just 1 / 2 to the N times N choose k.

Then the external path link of a trie is you look at the root.

That adds N to the external path length, because there's N nodes down below it.

And then 1 / 2 to the N, so it's a probability that there's k on the left,

which is in N choose k / 2 to the N.

And then it's the external path lengths, Ck, external path length N- k,

with the appropriate starting conditions.

And again, it's just a little tricky to realize that there's a Cn on

the right-hand side when k = 0.

So for BST, the probability that there were k in the left is 1/n.

We looked at random Catalan trees,

where that's the probabilities [COUGH] [INAUDIBLE] Catalan number.

And we had the Catalan distribution.

And for tries, it's a binomial distribution.

So that's the recurrence, and

we can use that to calculate small values or the distribution.

But that's the one that we want to solve.

And again, just by correspondence to the distributions that we looked at for

random Catalan trees and for binary search trees.

For binary search strings, the distribution's totally flat and

equally likely for any node to be at the root.

For Catalan trees, remember, we found that they were very likely

to have one side or the other have only a small number of nodes.

And so the distribution was very skewed and very skinny trees.

For AVL trees, we're trying to make them more balanced.

5:15

And the distribution shows this

amazing pattern that now we don't have it characterized analytically.

And for tries, the probability that the roots k is binomial.

And that's really the characteristic that we're looking for

that's going to tell us that these things are going to be pretty well balanced.

Because it tends to N over 2 is the most likely thing.

And with N squared of N of N over 2 is where the divisions are mostly

going to be, which means that it's going to be pretty well-balanced.

But let's look at that analytically.

Now, I'm not going to cover every detail in great detail,

but I think you'll see how we go from one step to the another.

And you can check details offline.

So this is the basic recurrence.

We're going to use exponential generating function for

this, and the reason is that N choose k allows us to.

If you divide this equation by N factorial,

then you C(N) over N factorial on the left.

Then sum that all and you get C(z) using

the exponential generating function.

And then the N choose k part of it is a convolution

6:41

of Ck over k factorial and 1 over k factorial.

And without very much calculation at all, you could verify this formula.

C(z) =, this is by divided by n factorials,

multiplying be z(N), and summing both sides.

Pretty quickly comes down to this formula.

And you can also actually get this directly through the symbolic method

with an appropriate operation, having to do with the way the tree is defined.

So very simple formula on the generating function.

It's a convolution of e to the z over 2, and c to the z over 2.

That's the generating function equation.

It can't be characterized as the simplest of generating

function equations that we're going to able to solve

explicitly because we have that z over 2 in the argument.

But one way to deal with it is to just iterate.

So if we apply the same equation for c of z over 2.

Just plug in the same equation.

Then either the z becomes z over 2e to the z over 2- z over 2,

and then 2 to the 4 c to the z over 4.

We iterate and then the argument gets smaller,

and smaller, and simplify.

So just collecting terms, but we still have an equation that we can iterate.

And in another step, you can see the pattern is that,

what we're going to wind up with is ze to the z in every case.

And then we have the declining z/2, 3z/4, 7z/8, and so forth.

So we get a sum of terms of the form z(e to the z- e to the z/2,

3z/4, 7z/8, and so on.

It's not you have to prove that this thing really goes away as you go to infinity.

But anyway, that's the iteration that we get.

An explicit formula for the generating function for the average external path.

That give us the average external path length in a trie.

9:04

Okay, so we're looking for the coefficient of z to the N in that.

So, well, N factorial times z to the N in that.

If you go ahead and expand e to the z,

multiply by N factorial and get the coefficient of z to the N.

Then we have this difference, 1- (1- 1/2 to the j) to the N-1).

Again that's a fairly simple formula.

Explicit formula for the average external path length in a trie.

It still does have the infinite sum, and

we're going to have to work with it a little bit.

I'm working through this with elementary analysis

just to make sure that people see what underlies this solution.

Because we get to a kind of surprising result, and

it's good to see what's underlying, what the structure is.

So that's what we're going to do.

To characterize this fully, analytically, requires complex analytic methods that

we'll talk about in part two of the course.

10:15

So what we're going to see is, on the next slide,

we're going to use the and x log approximation.

This thing becomes 1/1- e to the -N/2j.

And that's not a difficult calculation using x log.

In the next slide, we'll show that that's pretty close to N log base 2 of N.

10:43

So that's the external path length in a trie.

This step is not difficult.

I didn't mean to skip through it so quickly, but

it's not difficult at all to show that with x log.

But now, what I'm interested in is looking at this sum.

Sum on j greater than or equal to 0, 1 minus e to the minus N over 2 to the j.

11:06

How we're going to characterize that.

And that's a really interesting function to take a look at.

What I'm going to try to do with this analysis is try to

isolate the periodic terms.

There's an oscillation in here and

I want to try to isolate the part that oscillates.

The way that's going to happen is, we're going to take the function,

log base 2 of N, and we're going to work with the integer part of log base 2 of N.

And then the fluctuation is every time you come to an integer,

as you go from one integer to the next, there's fluctuation in the function.

Where as you're working with the real function log x, but

you're only picking off the integer parts of it.

So let's see how it works.

So all we're going to do is take that infinite sum and

we're going to break it into two parts.

The part where j is less than floor of log N, and

the part where it's greater than or equal to floor of log N.

So just split the sum into two parts at that one point.

And the key thing to notice about this is when j gets bigger than log base 2 of N,

then 2 to the j is going to be bigger than N.

And so we're going to have -N over something huge.

We're going to get a number very close to 1.

It's going to go away.

When j is small, like say j is 2 or something,

we have e to the -N, which is tiny.

The sum is just one.

So basically, were going to have log N terms that are very close to 1 and

all the rest of them are going to be very close to 0,

with just a little bit left in the center.

So let's look at how that goes.

In this case, we just split off the log N so we get floor of log N.

And then we have the second one, e to the -N / 2 to the j,

and that's just splitting the first sum into two parts.

And then this one is the same, so no change there.

13:13

So now, what we're going to do is

let this j go as far negative as as it wants.

Doesn't matter because if that j goes far negative,

this thing is exponentially small.

So it doesn't matter.

We're off by an exponentially small amount.

That's going to allow us to combine the two sums in that part.

So change j to j plus log N in this sum.

13:49

And same way, change j to j- log N in that sum.

And now we have some floor to the log Ns coming into these sums.

But then, there's N over floor to the log N.

Well that's like taking a function log N, subtracting off the integer part.

All that's left is the fractional part.

So the end sum of this thing is,

what's floor of log N?

That's the biggest integer less than log N.

That's the function log N- floor of log N.

As N increases, this function, fractional part of log N,

goes from 0 to 1 and back again.

So it's a fractional part of log N.

It's an oscillating function.

But these other functions, also, are just functions of the fractional part of log N.

Again, I skipped through just a tiny bit of subtracting log N from floor log N,

but you'll see that that works out.

So now I have these three parts that are all

functions of the fractional part of log N.

That is, they oscillate.

So this is my external path length over N.

So right here, this is a proof that it's asymptotic to N log base 2N,

because the other things are all 0 to 1.

But what's really interesting about this is,

if we look at these functions, this is the proof that I just went through.

If you have that recurrence, then that leads to this result.

But let's look at it in just a little more detail, these three terms.

15:42

Now the second one is this sum of e to the -2 fractional part of log N-j.

And that one also oscillates as N increases.

It oscillates, and that's the approximate magnitude of it.

And then this third one, it also oscillates.

So we have these three terms that oscillate that we're adding together to

give us the difference between log N in our function.

And what's amazing is, if you add these all together,

you get a smooth oscillating curve that's very tiny in magnitude.

10 to the -6 in magnitude.

16:27

Not something you would notice if you didn't do the math.

And we're going to see in part two,

we'll look into a little bit how to do the math.

But certainly, it's quite a surprise to see these functions that are so

difficult, different in character adding up to this continuous function.

16:52

So this is a proof that CN / N- log N is this strange oscillating function

that you wouldn't see unless you looked out to six decimal places.

And the question is is there a reason why a recurrence like that,

which seems such a natural recurrence involving our integer cost,

should have this strange periodic behavior.

And the answer is yes.

We're going to see,

when we get to complex analytic techniques in the Mellin Transform,

that there is a perfectly valid explanation for such periodicity.

And not only that, it's bound to appear in the analysis of lots of algorithms

that are based on properties of bitstrings like tries.

17:37

So applying that result and taking a look at the distribution that we get with

tries, you can see that it's even more tightly bound towards a particular shape.

Basically, divided at the center.

Even than BSTs.

And so, just to quote the results that we get from this kind of analysis,

the extra space is about 44% of the external nodes, or void.

18:07

The expected search cost is about N log raised 2 of N.

So that's the number of bits you have to examine.

And that's a very acceptable search cost and competitive.

And for leader election, well, that'll be an exercise.

That's a discussion of trie parameters.