0:00

In the last video, we discussed the performance of hash tables that are

Â implemented using chaining, using one link list per bucket. In fact, we proved

Â mathematically that if you use a hash function chosen uniformly at random from a

Â universal family, and if you keep the buckets, number of buckets, comparable to

Â the size of your data set, then in fact, you're guaranteed constant time expected

Â performance But, recall that chaining is not the only implementation of hash

Â tables. There's a second paradigm which is also very important called open

Â addressing. This is where you're only allowed to store one object in each slot,

Â and you keep searching for an empty slot When you need to insert a new object into

Â your hash table. Now it turns out it's much harder to mathematically analyze hash

Â tables implemented using open addressing But, I do want to say a few words about it

Â to give you the gist of what kind of performance you might hope to expect from

Â those sorts of hash tables. So recall how open addressing works. We're only

Â permitted to store one object in each slot So this is unlike the case with chaining

Â where we can have an arbitrarily long list in a given bucket of the hash table. With

Â at most one object per slot, obviously open addressing only makes sense when the

Â load factor alpha is less than one. When the number of objects you're storing in

Â your table is less than the number of slots available Because of this

Â requirement we have at most one object per slot we need to demand more of our hash

Â function. Our hash function might ask us to put a given object, say with some IP

Â address into say bucket number seventeen but bucket number seventeen might already

Â be full, might already be populated. In that case, we go back to our hash function

Â and ask it where to look for an empty slot next. So maybe it tells us to next look in

Â bucket 41. If 41 is full it tells us to look in bucket number seven and so on Two

Â specific strategies for producing a probe sequence that we mentioned earlier were

Â double hashing and linear probing. D ouble hashing is where you use two different

Â hash functions, h1 and h2. H1 tells you which slot in which to search first and

Â then every time you find a full slot you add an increment which is specified by the

Â second hash functions h2. Linear probing is even simpler you just have one hash

Â function that tells you where to search first and then you just add one to the

Â slot until you find an empty slot As I mentioned at the beginning, it is quite

Â nontrivial to Mathematically analyze the performance of hash tables, using these

Â various open addressing strategies. It's not impossible. There is some quite

Â beautiful and quite informative theoretical work. That does tell us how

Â hash tables perform But that's well outside the scope, of this course. So

Â instead what I wanna do is I want to give you a quick and dirty calculation. That

Â suggests, at least in an idealized world. What kind of performance we should expect

Â from a hash table with open addressing If it's well implemented As a function of the

Â load factor, alpha. Precisely, I'm going to introduce a heuristic assumption. It's

Â certainly not true but we'll do it just for a quick and dirty calculation, that

Â we're using a hash function in which each of the n-factorial possible probe

Â sequences is equally likely. Now, no hash function you're ever going to use is

Â actually going to satisfy this assumption, and if you think about it for a little

Â bit, you realize that if you use double hashing or linear programming, you're

Â certainly not going to be satisfying that assumption. So this will still give us a

Â kind of best case scenario against to which you can compare the performance of

Â your own hash table implementations. So if you [inaudible] hash table, and you're

Â seeing performance as good, as what's suggested by this idealized [inaudible]

Â analysis, then you're home free. You know your hash table is performing great. So

Â what is the line in the sand that gets drawn, under this heuristic assumption?

Â What is this idealized, idealized hash function performance? As a function of the

Â lo ad alpha Well here it is. What I'm gonna argue next is that under this

Â heuristic assumption, the expected amount of time to insert a new object into the

Â hash table, is going to be essentially one over one minus alpha, where alpha is the

Â load. Remember the load is the number of objects in the hash table divided by the

Â number of available slots. So if the hash table is half full, then alpha's going to

Â be.5. If it's 75 percent full then alpha's going to be three-fourths. So what this

Â means is that, in this idealized scenario, if you keep the load pretty under control.

Â So, say if the load is 50%, then the insertion time is gonna be great, right?

Â If alpha's.5 And 1/ (1-alpha) =two, so you expect just two probes before the

Â successful insert of the new object And of course, if you're thinking about lookup,

Â that's going to be at least as good as insert, so if you're lucky a lookup might

Â terminate early if you find what you are looking for. In the worst case you go all

Â the way until an empty slot in an unsuccessful search, and that's gonna be

Â the same as insertion. So if alpha is small bounded away from one, you're

Â getting constant time performance. On the other hand, as the hash table gets full,

Â as alpha gets close to one, this operation time is blowing up; it's such a going to

Â infinity as alpha gets close to one. So if you need to have a nice. 90 percent full

Â hash table with open addressing. You're gonna start seeing, ten probes. So, you

Â really wanna keep hash tables with open addressing. You wanna keep the load under

Â control Certainly no more than probably.7. Maybe even less than that To refresh your

Â memory, with chaining, hash tables are perfectly well-defined even with loads

Â factors bigger than one. What we derived is that under universal hashing, under a

Â weaker assumption, we had an operation time of one plus alpha, for a load of

Â alpha. So with chaining, you just gotta keep alpha, you know, at most, some

Â reasonably small constant with open address, and you really got to keep it

Â well bounded a way below one. So next let's understand why this observation is

Â true. Why under the assumption that every probe sequence is equally likely do we

Â expect a one over one minus alpha running time for hash tables with open addressing?

Â So, the reason is pretty simple. And we can derive it by analogy with a simple

Â coin flipping experiment. So, to motivate the experiment, think just about the very

Â first probe that we do. Okay, so we get some new objects, some new IP address that

Â we want to insert into our hash table. Let's say our hash table's currently 70

Â percent full. Say there's 100 slots, 70 are already taken by objects. Well, when

Â we look at this first probe, by assumption it's equally likely to be any one of the

Â 100 slots. 70 of which are full, 30 which are empty So, with probability of one

Â minus alpha, or in the case, 30%, our first Probe will, luckily, find an empty

Â slot and we'll be done. We'll just insert the new object into that slot If we get

Â unlucky with a probability, 70%. We find a slot that's already occupied and then we

Â have to try again. So we try a new slot, drawn at random And we again check is it

Â full, or is it not full? And again, with 30 percent probability, essentially it's

Â going to be empty and we can stop And if it's already full. Then we try, yet again.

Â So Doing random probes, looking for an empty slot, is tantamount to flipping a

Â coin with the probability of heads 1-alpha, or, in this example, 30 Percent

Â And the number of probes you need until you successfully insert is just the number

Â of times you flip this last coin until you see a heads. In fact this biased coin

Â flipping experiment slightly overestimates the expected time for insertions and the

Â heuristics assumptions and that's because in the insertion time whenever we're never

Â going to try the same slot twice. We're going to try all end buckets in some order

Â with each of the impact [inaudible] ordering equally likely So back to our

Â example, where we have a hash table with 100 slots, 70 of which are full. The first

Â probe indeed, we have a 30 in 100 chance of ge tting an empty slot. If that one

Â fails then we're not going to try the same slot again. So there is only 99 residual

Â possibilities. Again, 30 of which are empty. The one we checked last time was

Â full. So we actually have a 30 over 99 percent chance of getting an empty slot on

Â the second try. Like 30 over 98 on the third try, if the second one fails, and so

Â on But, a valid upper bond is just to assume a 30 percent success probability

Â with every single probe, and that's precisely, what this coin flipping

Â experiment will get us. So the next quiz will ask you to actually compute the

Â expected value of capital N, the number of coin flips, needed to get heads when you

Â have a probability of heads of one minus alpha. As a hint, we actually analyzed

Â this exact same coin flipping experiment when alpha equals a half, back when we

Â discussed the expected running time of randomized linear time selection. Alright,

Â so the correct answer is the first one. One over 1-alpha So to see why, let's

Â return to our derivation, where we reduced analyzing the expected insertion time to

Â this random variable. The expected number of coin flips until we see a heads. So,

Â I'm gonna solve this exactly the same way that we did it back when we analyzed a

Â randomized, selection algorithm. And it's quite a sneaky way, but very effective.

Â What we're going to do is we're going to express the expected value of capital N,

Â in terms of itself, and then solve. So how do we do that? Well on the left hand side

Â let's write the expected number of coin flips, the expected value of capital N,

Â and then let's just notice that there's two different cases, either the first coin

Â flip is a heads or it's not. So in any case you're certainly going to have one

Â coin flip so let's separate that out and count it separately. With probability

Â alpha, the first coin flip is gonna be tails and then you start all over again

Â And because it's a memory less process, the expected number of further coin flips

Â one requires, given that the first coin flip was tails, is just the same as the

Â expected number of coin flips in the first place. So now it's a simple matter to

Â solve this one linear equation for the expected value of N, and we find that it

Â is indeed one over one minus alpha, as claimed. Summarizing, under our idealized

Â heuristic assumption, that every single probe sequence is equally likely, the

Â expected insertion time is upper bounded by the expected number of coin flips,

Â which by this argument is, at most, one over one minus alpha. So, as long as your

Â load, alpha, is well bounded below one, you're good. At least in this idealized

Â analysis, you're hash table will, will work extremely quickly. Now I hope you're

Â regarding this idealized analysis with a bit of skepticism. Right, from a false

Â hypothesis you can literally derive anything you want. And we started with

Â this assumption which is not satisfied, by hash functions you're actually going to

Â use in practice. This heuristic assumption, that all probe sequences are

Â equally likely. So, should you expect this one over one minus alpha bound to hold in

Â practice or not? Well, that depends to some extent. It depends on what open

Â addressing strategy you're using. It depends on, how good a hash function

Â you're using. It depends on whether the data is pathological or not. So, just to

Â give course rules of thumb If you're Using double hashing and you have

Â non-pathological data, I would go ahead and look for this 1/1-alpha bound in

Â practice. So implement your hash table, check its performance as a function of the

Â load factor alpha and shoot for the 1/1-alpha curve. That's really what you'd

Â like to see. With linear probing, on the other hand, you should not expect to see

Â this performance guarantee of 1/1-alpha even in a totally idealized scenario.

Â Remember, linear probing is the strategy where your initial probe, the hash

Â function, tells you where to look first, and then you just skim linearly through

Â the hash table until you find what you're looking for, an empty slot, the. That

Â you're looking up or whatever So a linear probing, even in a best case scenario,

Â it's going to be subject to clumping. You're going to have contiguous Groups of

Â slots which are all full, and that's because of the linear probing strategy.

Â Now I encourage you to do some experiments with implementations to see this for

Â yourself. So because of clumping with linear probing, even in the idealized

Â scenario, you're not going to see one over one minus alpha. However, you're going to

Â see something worse, but still in idealized situations. Quite reasonable so

Â that's the last thing I want to tell you about In this video. Now needless to say,

Â with linear probing the heuristic assumption is badly false. The heuristic

Â assumption is pretty much always false to no matter what hashing strategy you're

Â using, but with linear programming it's quote on quote really false. So to see

Â that, the heuristic assumption, say that all in factorial probe sequences are

Â equally likely. So your next probe is going to be uniform or random amongst

Â everything you haven't probed so far but when you're probing, it's totally the

Â opposite. Right once you know the first slot that you're looking into say bucket

Â seventeen, slot a7 is gonna be the first slot, you know the rest of the sequence

Â because it's a linear [inaudible] cancel the table. So it's kind of [inaudible] the

Â opposite from each successive probe being independent from the previous ones except

Â not exploring things twice. So to state a conjectured or idealized performance

Â guarantee for hash tables with linear probing, we're going to place, replace the

Â blatant false heuristic assumption by a still false, but more heuristic reasonable

Â assumption. So what do we know? We know that the initial probe with linear probing

Â determines the rest of the sequence. So let's assume that these initial probes are

Â uniform at random, and independent for different keys. Of course, once you have

Â the initial probe, you know everything else, but let's assume independence and

Â uniformity amongst The initial probes. Now, this is a strong assumption. This is

Â way stronger than assuming you ha ve a universal family of hash functions. This

Â assumption is not satisfied Practice, but Performance guarantees we can derive under

Â this assumption are typically satisfied in practice by well implemented hash tables

Â that use linear probing. So, the assumption is still useful for deriving

Â the correct, idealized performance of this type of hash table. So what is that

Â performance? Well this is an utterly classic result from exactly 50 years ago

Â From 1962 And this is a result by my colleague, the living legend, Don Canuth,

Â author of Art of Computer Programming. At what can proved is, was that is that under

Â this weaker [inaudible] assumptions, suitable for linear probing. The expected

Â time to insert an object into a hash table with a load factor alpha, when you're

Â using linear probing is worse than one over one minus alpha, but. It is still a

Â function of the load alpha only and not a function of the number of objects in the

Â hash table. That is with linear programming you will not get as good a

Â performance guarantee, but it is still the case that if you keep the load factor

Â bounded away from one. If you make sure the hash table doesn't get too full you

Â will enjoy constant time operations on average so for example if with linear

Â probing your hash table is 50 percent full then you are going to get an expected

Â insertion time of roughly four probes. Note however this quantity does approach

Â does blow up pretty rapidly as the hash table grows full. If it is 90 percent full

Â this is already going to be something like a hundred probes on average. So you really

Â don't wanna let hash tables get too full when you are using linear probing. You

Â might well wonder if it's ever worth, implementing linear probing, given that it

Â has the worst performance curve, one over one minus alpha squared. Then the

Â performance curve you'd hope from something like double hashing, one over

Â one, minus alpha. And it's a tricky cost benefit analysis between linear probing

Â and more complicated but better performing strategies. That really depends on the ap

Â plication. There are reasons that you do want to use linear probing sometimes, it

Â is actually quite Common in practice For example, it's often interacts very well

Â with memory hierarchies So again, as with all of this hash and discussion. You know

Â the costs and benefits Are, are very subtle trade-offs between the different

Â approaches. If you have mission critical code that's using a hash table and you

Â really want to optimize it. Try a bunch of prototypes, and just test. Figure out

Â which one is the best, for your particular type of application. Let me conclude the

Â video with a quote from Canuck himself where he talks about the rapture of

Â proving this one of our one man is half a square theorem and how it was life

Â changing. He says I first formulated the following derivation, meaning, the proof

Â of that last theorem in 1962. Ever since that day, the analysis of algorithms has,

Â in fact, been one of the major themes in my life.

Â