0:00

This sequence of videos are going to take it to the next level with Hash tables and

Â understand more deeply the conditions under which they perform well, amazingly

Â well in fact As you know, we've constant time performance for all of their

Â operations. The main point of this first video is to explain in sense in which

Â every hash function has its own Kyptonite. A pathological data set for

Â it which then motivates the need to tread carefully with mathematics in

Â the subsequent videos. So, a quick review So remember that the whole purpose of a

Â hash table is to enable extremely fast look ups, ideally constant time lookups.

Â Now, of course, to have anything to look up, you have to also allow insertions. So

Â all hash tables are going to export those two operations And then, sometimes, a hash

Â table also allows you to delete elements from it. That depends a little bit on the

Â underlying implementation. So certainly when you have it implemented using

Â chaining, that's when you have one linked list per bucket, it's very easy to

Â implement deletion. Sometimes, with open addressing, it's tricky enough, you're

Â just gonna wind up punting on deletion. So when we first started talking about hash

Â tables, I encouraged you to think about them logically. Much the way you do as an

Â array, except instead of being indexed just by the positions of an array, it's

Â indexed by the keys that you're storing. So just like an array via random access

Â supports constant time look up, so does a hash table. There was some fine print

Â however with hash tables. Remember there were these two caveats. So the first one

Â is that the hash table better be properly implemented And this means a couple of

Â things. So, one thing that it means is that the number of buckets better be

Â commensurate with the number of things that you're storing in the Hash table, we'll

Â talk more about that in a second. The second thing it means is you better be

Â using a descent Hash function. So we discussed in a previous video the perils

Â of bad Hash functions, and it will be even more stringent with our demands on Hash

Â functions in the videos to come. The second caveat which I'll try to

Â demystify in a few minutes is that you better have non-pathological data. So, in

Â some sense, for ever Hash table there's Kyptonite , a pathological

Â data set that will render its performance to be quite poor. So in the video on

Â implementation detail we also discussed how hash tables inevitably have to deal

Â with collisions. So you start seeing collisions way before your hash table's

Â start filling up so you need to have some sort of method for addressing two

Â different keys that map to exactly the same bucket. Which one do you put there? Do you

Â put both there or what? So there's two popular approaches let me remind you what

Â they are First called chaining. So this is a very natural idea, where you just keep

Â all of the elements that hash to a common bucket in that bucket. How do you keep

Â track of all of them? Well, you just use a linked list. So, in the seventeenth

Â bucket, you will find all of the elements which ever hashed to bucket number 17.

Â The second approach which also has plenty of applications in practice is

Â open addressing. Here the constraint is that you are only going to store one item

Â one key in each bucket. So if two things mapped the bucket number seventeen, you

Â gotta find a separate place for one of them And so the way that you handle that

Â is you demand from your hash function not merely one bucket but rather a whole probe

Â sequence. So the sequence of buckets so that if you try to hash something into

Â bucket number seventeen, 17's already occupied then you go on to the next.

Â Bucket in the probe sequence. You try to insert it there And if it, you fail again

Â you go to third bucket in the sequence. You try to insert it there, and so on. So

Â we mentioned briefly the sorta two ways you can specify probe sequences. One is

Â called linear probing. So this is where if you fail in bucket seventeen you move on

Â to eighteen and then nineteen and then twenty and then 21. And you stop once you

Â find an empty one And that's where you insert the new element And another one is

Â double hashing And this is where you use a combination of two hash functions Where

Â the first hash function specifies the initial bucket that you probe. The second

Â hash function specifies the offset for each subsequent probe. So for example if

Â you have a given elements say the name Alice and the two hash functions give you

Â the number 17 and 23 then the corresponding finding probe sequence is

Â going to be initially 17, failing that we'll try 40, still failing that

Â we'll try 63, failing that we'll try 86 and so on. So in a course on the design

Â and analysis of algorithms like this one you typically talk a little bit more about

Â chaining than you do open addressing. That's not to imply that chaining is

Â somehow the more important one both of these are important But chaining is a

Â little easier to talk about mathematically. So we will talk about it a

Â little bit more cuz I'll be able to give you complete proofs for chaining whereas

Â complete proofs for open addressing would be outside the scope of this course. So

Â I'll mention it in passing but the details will be more about chaining just for

Â mathematical ease. So, there's one very important parameter which plays a big role

Â in governing the performance of a hash table, and that's known as the load

Â factor, or simply the load, of a hash table. And it's a very simple definition.

Â It just talks about how populated, a typical bucket of the hash table is. So

Â it's often denoted by alpha And in the numerator is the number of things that

Â have been inserted, and not subsequently deleted in the hash table. Divided by

Â the number of buckets in the hash table. So, as you would expect, as you insert

Â more and more things into the hash table, the load grows, keeping the number of

Â items in the hash table fixed as you scale up the number of buckets, the load drops.

Â So just to make sure that the notion of the load is clear, and that also you're

Â clear on the different strategies for resolving collisions, the next quiz will

Â ask you about the range of relevant alphas for the chaining and open addressing

Â implementations of the hash table. Alright, so the correct answer to this

Â quiz question is the third answer. Load factors bigger than one do make sense they

Â may not be optimal but they at least make sense for hash tables that implement with

Â chaining but they don't make sense for hash tables with open addressing. And the

Â reason is simple remember in open addressing you are required to store only

Â one object per bucket so as soon as the number of objects exceeds the number of

Â buckets there is no where to put the remaining objects. So the hash the hash

Â table will simply crash if, if load factor is bigger than one. On the other hand a hash table

Â with chaining there is no obvious problems with load factor bigger than one so you

Â can imagine a load factor equal to two say, say you insert 2,000 objects into a

Â hash table with 1,000 buckets. You know that means, hopefully At least in the best

Â case. Each buckets just gonna have a length list with two objects in it. So

Â there's no big deal. With having load factors bigger than one, and hash tables

Â With chaining. Alright, so let's then make a, a quite easy but also very important

Â observation about a necessary condition for hash tables to have good performance

Â And this goes into the first caveat that you better properly implement the hash

Â table if you expect to have good performance. So the first point is that

Â you're only gonna have constant time look-ups if you keep the load to be

Â constant. So for a hash table with open addressing, this is really obvious,

Â because you need alpha not just O(1) but less than one, less than 100 percent full,

Â otherwise the hash table is just gonna crash, cuz you don't have enough room for

Â all of the items, but even for hash tables that you implement using chaining, where

Â they at least make sense for load factors which are bigger than one, you'd better

Â keep the load not too much bigger than one if you want to have constant-time

Â operations. Right, so if you have, say, a hash table with. N buckets and you hash in

Â NlogN objects, then the average number of objects in a given bucket is gonna be

Â logarithmic And remember, when you do a lookup, after you hash to the bucket, you

Â have to do an exhaustive search through the linked list in that bucket. So if you

Â have NlogN objects and you hashed it with N buckets, you're expecting more like

Â logarithmic lookup time - not constant lookup time And then, as we discussed with

Â open addressing, of course, you need not just alpha = O(1), but alpha less

Â than one. And in fact, alpha better be well below one. You don't want to let an

Â open addressing table get to a 90 percent load or something like that. So I'm going

Â to write need Alpha less than less than one. So that just means you don't want to

Â let the load grow too close to 100%, you will see performance degrade. So again, I

Â hope the point of this slide is clear. If you want good hash table performance, one

Â of the things you're responsible for is keeping the load factor under control.

Â Keep it at most a small constant with open address and keep it well below 100%. So

Â you might wonder what I mean by controlling the load. After all, you know

Â you writing this hash table when you have no idea what some client's gonna do with

Â it. They can insert or delete whatever they want. So how do you, how do you

Â control alpha? Well, what you can control under the hood of your hash table

Â implementation is the number of buckets. You can control the denominator Of this

Â alpha so if the numerator starts growing at some point the denominator is going to

Â grow as well. So what actual implementations of hash tables do is they

Â keep track of the population of the hash table. How many objects are being stored,

Â and as this numerator grows, as more and more stuff gets inserted, the

Â implementation ensures that the denominator grows at the same rate so that

Â the number of buckets also increases. So if alpha exceeds some target, you know,

Â that could be say 75%, .75, or maybe it's.5. Then what you can do is you can

Â double the number of buckets, say in your hash table. So you define a new hash

Â table, you have a new hash function with double the range, and now having doubled

Â the denominator. The load has dropped by a factor two. So that's how you can keep it

Â under control. Optionally, if space is at a real premium, you can also shrink the

Â hash table if there's a bunch of deletions, say, in a chaining

Â implementation. So that's the first take away point about what has to be happening

Â correctly under the hood in order to get the desired guarantees for hash table

Â performance you gotta control the load. So you have to have a hash table whose size

Â is roughly the same as the as the number of objects that you are storing. So the

Â second thing you've gotta get right and this is something we've touched on in the

Â implementation videos is you better use a good enough hash function And so what's a

Â good hash function? It's something that spreads the data out evenly amongst the

Â buckets And what would really be awesome would be a hash function which works well

Â independent of the data And that's really been the theme of this whole course so

Â far, algorithmic solutions which work independent of any domain assumptions. No

Â matter what the input is, the algorithm is guaranteed to, for example, run blazingly

Â fast And I can appreciate that this is exactly the sort of thing you would want

Â to learn from a course like this, right? Take a class in the design analysis of

Â algorithms and you learn the secret hash function which always works well.

Â Unfortunately I'm not going to tell you such a hash function And the reason is not

Â cuz, I didn't prepare this lecture. The reason is not because people just haven't

Â been clever enough to discover such a function. The problem is much more

Â fundamental. The problem is that such a function cannot exist. That is for every

Â hash function it has its own kryptonite. There is a pathological dataset under

Â which the performance of this hash function will be as bad as the most

Â miserable constant hash function you'd ever seen. And the reason is quite

Â simple; it's really an inevitable consequence of the compressing that hash

Â functions are effectively implementing from some massive universe to some

Â relatively modest number of buckets. Let me elaborate. Fix any hash function as

Â clever as you could possibly imagine. So this hash function maps some universe

Â through the buckets indexed from 0 to n -1. Remember in all of the

Â interesting situations of hash functions, the universe size is huge, so the

Â cardinality of U should be much, much bigger than n. That's what I'm going to

Â assume here. So for example, maybe you're remembering people's names, and then the

Â universe is strings, which have, say at most, 30 characters, and N, I assure you

Â in any application is going to be much, much, much smaller than say, 26 raised to

Â the thirtieth power. So now let's use a variant on the pigeon hole principle, and

Â acclaim that at least one of these n buckets has to have at least a 1/n

Â fraction of the number of keys in this universe. That is, there exists a bucket

Â I, somewhere between 0 and n -1 Such that, at least the cardinality of

Â the universe over N Keys Hash to I get mapped to I under this hash function H. So

Â the way to see this is just to remember the picture of a hash function mapping in

Â principal any key from the universe, all keys from the universe to one of these

Â buckets. So the hash function has to put each key somewhere in one of the n buckets

Â So one of the buckets has to have at least a 1/n fraction above all of the possible

Â keys. One more concrete way of thinking, thinking about it is that you might want

Â to think about a hash table implemented with chaining. You might want to imagine,

Â just in your mind, that you hash every single key into the hash table. So this

Â hash table is going to be insanely over-populated. You'll never be able to

Â store it on the computer because it will have the full cardinality of U objects

Â in it, but it has U objects, it only has n buckets. One of those buckets has to have

Â at least U /n fraction of the population. So the point here is that no

Â matter what the hash function is no matter how clever you build it there's gonna be

Â some buckets say bucket number 31 which gets its fair share of the universe maps

Â to it. So having identified this bucket, bucket number 31 where it gets at least

Â its fair share of the universe maps to it now to construct our pathological dataset

Â all we do is picked from amongst these elements that get mapped to bucket number

Â 31. So for such as a data set and we can make this data set basically as large as

Â we want because the cardinality of U/n is unimaginably big, because U itself is

Â unimaginably big Then in this data set, everything collides. The hash function

Â maps every single one to bucket number 31 and that's gonna lead to Terrible hash

Â table performance. Hast table performance which is really no better than the naive

Â linked list solutions. So for example an hash table with collisions, in bucket 31,

Â you'll just find a link listed every single thing that's ever been inserted

Â into the hash table and for open addressing maybe it's a little harder to

Â see, what's going on but again if everything collides, you're gonna

Â basically wind up with linear time performance A far cry from constant time

Â performance. Now, for those of you to whom this seems like just sort of pointless,

Â abstract mathematics, I want to make two points. The first is that at the very

Â least these pathological data sets. Tells us, indicates, that we will have to

Â discuss hash functions in a way differently than how we've been discussing

Â algorithms so far. So when we talked about merge sort, we just said it runs in n log

Â n time. No matter what the input is. Whether we discuss the Dijkstra's algorithm

Â It runs in n log n time, no matter what the input is. That first search, that

Â first search many a time no matter what the input is. We're gonna have to say

Â something different about hash functions. We'll not be able to say that a hash table

Â has good performance, no matter what the input is. This slide shows that's false.

Â The second point I want to make is that while this pathological data

Â sets of course are not likely to arise just by random chance. Sometimes you're

Â concerned about somebody constructing a pathological data set for your hash

Â function, for example in a denial service attack. So there's a very clever

Â illustration of exactly this point in a research paper, from 2003 By Crosby and

Â Wallach. So the main point of Crosby and Wallach was that there's a number of real

Â world systems And maybe there, most interesting application was a network

Â intrusion detection system, for which you could bring them to their knees by

Â exploiting badly designed hash functions. So these were all applications that made

Â crucial use of hash tables and, the feasibility of these systems really,

Â completed depended on getting cost and time performance from the hash tables. So

Â if you could exhibit a pathological data set for these hash tables and make the

Â performance devolve to linear, devolve to that of a simple link list solution, the

Â systems would be broken, they would just crash of they'd fail to function. Now we

Â saw in the last slide that every hash table does have its own Kryptonite. Has a

Â pathological data set But the question is. How can you come up with such a

Â pathological data set if you're trying to do a denial of service attack on one of

Â these systems? And so the systems that Crosby and Wallach looked at generally

Â shared two properties. So first of all, they were open source. You could inspect

Â the code. You could see what hash function it was that they were using And second of

Â all, the hash, function was often very simplistic. So it was engineered for speed

Â more than anything else And as a result, it was easy to, just be inspecting the

Â code, reverse engineer a data set. That really did break the hash table. That led

Â it That devolved the performance to linear. So for example, in the network

Â intrusion detection application, there was some hash table that was just remembering

Â the IP addresses of packets that we re going through, because it was looking for

Â patterns of pack, data packets that seemed to indicate some sort of intrusion. And

Â Crosby and Wallach just showed how sending a bunch of data packets to this system

Â with cleverly chosen sender IP's really did just crash the system because the hash

Â table performance blew up to an unacceptable degree. So how should we

Â address this fact of life that every hash function has a pathological data set? And

Â that question is meaningful both from a practical perspective, so what hash

Â functions should we use if we are concerned about someone constructing

Â pathological data sets, for example, the implemented denial-of-service attack, and

Â secondly, mathematically, if we can't give the kinds of guarantees we've given so

Â far, data-independent guarantees, how can we mathematically say that hash functions

Â have good performance. So let me mention two solutions, the first solution is, is

Â meant more just on the practical point. You know what hash function should you

Â implement if you are concerned with someone creating pathological data sets?

Â So there are these things called cryptographic hash functions. There for

Â example, one cryptographic hash function or really family of hash functions, for

Â different numbers of buckets, is SHA-2 And these are really outside the scope of

Â this course. I mean, you'll learn more about this in a In a course on

Â cryptography And obviously using these keywords you can look it up on the web and

Â read more about it. The one point I want to make is, you know these cryptographic

Â hash functions like SHA2, they themselves, they do have pathological datasets. They

Â have their own version of kryptonite. The reason that they work well in practice is

Â because it's infeasible to figure out what this pathological dataset is So unlike the

Â very simplistic hash functions which Crosby and Wallach found in the source

Â code of their applications, where it was easy to reverse-engineer bad datasets, for

Â something like SHA2 nobody knows how to reverse-engineer a bad dataset for it And

Â when I say infeasible, I mean in the usual cryptographic sense, in a way similar to

Â how one would say it's infeasible to break the RSA cryptosystem if you implement it

Â properly, or it's infeasible to factor large numbers except in very special case,

Â and so on. So that's all I'm going to say about cryptographic hash functions. I also

Â want to mention a second solution, which would be reasonable both in practice, and

Â also for which we can say a number of things mathematically about it Which is to

Â use randomization. So more specifically we're not going to design, a single.

Â Clever hash function Because again, a single hash function we know Must have a

Â pathological data set But we're gonna design a very clever, family of hash

Â functions And then, at run time. We're going to pick, one of these hash functions

Â at random. Another kind of guarantee you are going to want, we are going to be able

Â to prove on a family affairs function would be very much in the spirit of quick sort.

Â So you would call that in the quick sort algorithm, pretty much. For any

Â fixed pivot sequence there is a pathological input for which quick sort

Â will devolve to quadratic running time. So our solution was randomize quick sort

Â which is rather than committing up front to any particular method of choosing

Â pivots at run time we are gonna pick pivots randomly. What did we prove about

Â quick sort? We proved that for any possible input for any possible array the

Â average running time with quick sort was O(nlogn) with the average was over the

Â run time random choices of quick sort. Here we are gonna do the same thing we'll

Â now be able to say for any dataset. On average, with respect to our run time

Â choice of a hash function. The hash function will perform well, in the sense

Â that it will spread out the data of the data set evenly. So we flip to the

Â quantifiers from the previous slide. There, we said if we pre-commit to a

Â single hash function. If we fix one hash function, Then there's a data set that

Â breaks the hash function. Here we're flipping it, we're saying for each fixed

Â data set a random choice of a hash function is going to do well on average on

Â that data set. Just like in Quick Sort. This doesn't mean notice that we can't

Â make our program open source. We can still publish code which says here is our family

Â of hash functions and in the code we would be making a random choice from this set of hash

Â functions But the point is by inspecting the code you'll have no idea what was the

Â real time random choices made by the algorithm. So you'll know nothing about

Â what the actual hash function is so you won't be able to reverse engineer

Â pathological dataset for the real time choice of the hash function. So the next

Â couple of videos are gonna elaborate on the second solution Of using a real time

Â random choice of a hash function as a way of saying. You do well on every data set

Â At least on average. So let me just give you a road map of where we're going to go

Â from here. So I'm going to break the discussion of the details of a randomized

Â solution into three parts, spread over two videos. So in the next video we're going

Â to begin with the definition of what do I mean by a family of hash functions, so

Â that if you pick one at random, you're likely to do pretty well. So that's a

Â definition that's called a universal family of hash functions. Now a

Â mathematical definition by itself is worth approximately nothing. For it to be

Â valuable, it has to satisfy two properties. So first of all there have to

Â be interesting and useful examples that meet the definition. So that is, there

Â better be Useful looking hash functions that meet this definition of a universal

Â family. So the second thing will be to show you that they do indeed exist And

Â then the other thing a mathematical definition needs is applications. So that

Â if you can meet, the definition. Then, good things happen. So that'll be part

Â