0:04

I just want to briefly discuss hash tables because they are quite

Â costly related to words and birthday problems, but

Â we won't spend a lot of time in lecture on this more information in book.

Â So, we talked just briefly about the balls and urns model and,

Â of course, this is very heavily studied in classical combinatorics.

Â And so, we talked about,

Â well, what's the probability that no urn has more than one ball.

Â What's the probability that no urn's empty?

Â You might want to know how many empty urns there are.

Â How many urns have k balls and so forth.

Â These so called occupancy problems, are well studied,

Â dating back many many decades.

Â 0:59

The probability, and we looked at this briefly when we talked about asymptotics.

Â The probability that a given value occurs k times in a random word of length N,

Â is (N choose k) (1/M) to the k (1- 1/M) to the N- k.

Â That is, you get your value, k times the probability of 1 over M, and

Â the other values are not yours, with probability (1- 1/M) to the N- k.

Â So, given N balls and M urns, the probability given urn has k balls.

Â 1:32

Now we looked at the Poisson approximation that works for

Â the situation where N/M is alpha and

Â it's fixed and k is a small constant.

Â That's the so-called Poisson approximation.

Â And we showed how to come up with that,

Â approximation in when we talked about asymptotic.

Â 2:01

So, and this is a very good approximation.

Â So for example, this plot gives for

Â this is for alpha = 10.

Â So N = 10,000, M = 1,000.

Â That's a plot of the binomial distribution.

Â 2:44

So an application using hashing algorithm so

Â people in computer science are very familiar with this algorithms.

Â I'll just briefly describe them for people.

Â In math that maybe you haven't seen them.

Â They were very well covered in books on algorithms like our or book.

Â The idea is we want an efficient way to put key value pairs in a symbol table.

Â And we talked about this for binary search trees.

Â And you want to search the table for the pair of corresponding to a given key.

Â So the hashing strategy is to come up with a function

Â that maps each key onto a random value between zero and M-1.

Â And then develop a collision strategy to figure out what to do when two

Â keys hash to the same value.

Â 3:35

And the basic algorithms that we'll talk about one's called separate

Â chaining where we essentially represent urn with linked list.

Â Another one is called linear probing where we just use an array, we never let.

Â To get in the same place where we scan through the empty spots on the collision.

Â And we'll talk briefly about both of those algorithms.

Â And the model that we use is this idea that the hash function,

Â we assume what's called the Uniform Hashing Assumption, that the hash function

Â maps each key into a rad value between zero and n minus 1.

Â And it's been shown to be not so difficult for typical types of

Â keys to get hash functions that reasonably approximate this uniform assumption.

Â 4:28

So, that's the setup.

Â So hashing with separate chaining is really easy to implement and this is just

Â a diagram from our algorithms book that show It was a hash table of size 5.

Â And so the keys are letters.

Â And then the hash values are supposed to be random values between 0 and 5.

Â It's just a word.

Â So the sequence of hash values is just a random word.

Â And then we store the letters in the list index by the hash values,

Â so those are the earns and the keys, and the associated values are the balls.

Â 5:08

So, that's easy to implement and,

Â of course, we're going to be interested in how long these lists are.

Â So again, This is just the balls into urns model for hashing with separate chaining.

Â 5:25

And I'm probably spending too much

Â time on animations, but [SOUND].

Â So, it what we want to know is when we're going to search for

Â an item in this we're going to have to go through all the balls in the urn,

Â so we want to know the average number of balls in each urn.

Â 5:50

Well that's obvious, there's N balls, there's M urns.

Â Our average is N/M balls in each urn.

Â And actually, that doesn't depend on anything.

Â It's not very helpful at all.

Â 6:03

If all your balls fall in one urn,

Â still your average number of balls per urn is N/M.

Â Doesn't have to be random, even.

Â So that observation is not much help.

Â 6:16

So, and we know that the probability that a given urn has k balls,

Â that's the occupancy distribution.

Â Generally, we're going to have, try to keep the number of

Â urns large enough that our number of balls per urn is a constant, like alpha.

Â So we know that.

Â But the programmer wants to know is

Â something about what's the chances that they're distributed evenly.

Â So we have to do a little more math to provide that answer.

Â So here's what a programmer might say.

Â So finally make sure that M is big enough, so

Â that N/M is less than some constant alpha.

Â And it turns out to be not difficult to do that.

Â 7:01

Then the average number of probes for searches is going to be less than alpha,

Â I know that.

Â But what's the chance that some search will use way too many probes?

Â Under the uniform hashing assumptions, say, 5 alpha probes.

Â 7:30

And the trick is to use Stirling's formula to bound k factorial.

Â And then if you do that then we get something that converges very rapidly.

Â So something like e to the minus alpha times u over 5 to the 5 alpha.

Â And say for alpha equals 10,

Â that's button of 15 zeros, N to the minus 15.

Â So you can say to the programmer that if you take alpha equals 10,

Â this is extremely, extremely unlikely to happen.

Â And that's the kind of information the programmer needs.

Â So that's hashing with separate chaining,

Â that's a typical calculation using classical occupancy distribution in some

Â of the asymptotics that we did in lecture four.

Â 8:36

And everything's fine until we get

Â a collision where, and then we take two balls,

Â and we know from the breathing column that going to happen relatively soon,

Â so when it happen we just can skip to the right until we find an empty urn.

Â 8:57

You don't want to do this as the table start to get full as

Â we will see still we're going to want to know that have analysis that

Â tell us that tell us the average number of collisions when we use this algorithm.

Â And it's a relatively easy algorithm to implement as well.

Â Found in standard algorithm books as well.

Â So that is the key question, what is the average

Â number of probes to find one of the keys,

Â if you use this algorithm, as we would run your program.

Â 9:43

from k bigger than 0 of N over N, N- 1 over N,

Â down tot N- k + 1 over N.

Â Which is really,

Â well the proof is given in ten pages in the textbook.

Â It really was a landmark result.

Â People were thinking this was too difficult a problem to really address and

Â Knuth was able to show this result.

Â And if you let the table get full,

Â if you let N get to M and it's around the nugent function.

Â And if you don't let the table get full.

Â Which we don't, in practice.

Â And I say don't let it get more than half full,

Â then it's going to be one over one minus alpha.

Â And if you let it get more that half full it's about two proofs.

Â 11:00

And Knuth taught that good writers don't use footnotes in technical writing.

Â But he said he couldn't resist putting in this one footnote that

Â said that he formulated this derivation.

Â It was the first non-trivial algorithm he had every analyzed satisfactorily.

Â And it says it had a strong influence on the structure of these books.

Â And it really is where the analysis of algorithms began,

Â was when Knuth solved this problem.

Â 11:45

It seemed reasonable,

Â that we should be able to analyze when you're probing with the symbolic method.

Â So, we can resist putting in one foot note, in our book, either.

Â And that one said that we don't know the answer to this exercise.

Â Now we couldn't figure out how to use the symbolic method to analyze linear probing.

Â It seemed like their answer was so simple and

Â so similar to birthday coupon collector, that we should be able to get it.

Â And really we put this in as a challenge to students and researchers.

Â Really, you know, we should be able to do this one.

Â And it took some years, but eventually, and you can read about this on,

Â in the second edition of the book turn out this problem has a very,

Â a deep connections to properties of commentarial objects.

Â Like, random graphs.

Â The Gambler Ruin problem.

Â Path links in trees and many other classic algorithms.

Â And it's explained by A distribution called an Airy law.

Â And very fascinating amount of mathematics to explain this.

Â And Knuth was one of the people who solved it.

Â And Phillipe and some co-authors solved that, as well.

Â Although, actually, still we don't quite exactly,

Â precisely know the answer to this exercise, we know quite a bit.

Â So there's actually still a footnote left in the second edition.

Â But linear probing is worth studying to see both the potential and

Â the limitations of what we know about analysis of algorithms at this point.

Â One of the foundations mentioned here is properties of mappings and

Â that's what we're going to talk about next.

Â