0:00

So now that we understand why we can't have a single hash function which always

Â does well at every single data set, that is every hash function is subject to a

Â pathological data sets. We'll discuss the randomize solution of how we can have a

Â family of hash functions and if you make a real time decision about which hash

Â function to use, you're guaranteed to do well on average, no matter what the data

Â is. So let me remind you of the three prong plan that I have for this part of

Â the material. So in this video, we'll be covering the first two. So part one, which

Â we'll accomplish in the next slide, will be to propose a mathematical definition of

Â a good random hash function. So formerly, we're going to define a universal family

Â of hash functions. Now, what makes this definition useful, well, two things. And

Â so, part two, we'll show that there are examples of simple and easy to compute

Â hash functions that meet this definition, that are universal in the sense described

Â on the next slide. So that's important. And in the third part, which we'll do in

Â the next video, will be the mathematical analysis of the performance of hashing,

Â specifically with chaining when you use universal hashing. And we'll show that if

Â you pick a random function from a universal family, then, the expected

Â performance of all of the operations are constant. Assuming, of course, of the

Â number of buckets is comparable to the number of objects in the hash table which

Â we saw earlier is a necessary condition for good performance. So let's go ahead

Â and get started and let's say what we mean by a good random hash function. So for

Â this definition, we'll assume that the universe is fixed. So maybe it's IP

Â addresses, maybe it's our friends names. Maybe it's configurations of a chessboard,

Â whatever. But there's some fixed universe u and we'll also assume we've decided on

Â the number of buckets n. And we call the set H universal if and only if it meets

Â the following condition. In English, the condition says that for each pair of

Â distinct elements, the probab ility that they collide should be no larger than with

Â the gold standard of perfectly uniform random hashing. So for all distinct keys

Â from the universe, call them x and y, what we want is that the probability if we

Â choose a random hash function, h, from the set script h, the probability that x and y

Â collide. And again, just to be clear, what that means is that, x and y hash to

Â exactly the same bucket under this hash function h, this should be no more than

Â 1/n, and don't forget n is the number of buckets. Again, to interpret this, you

Â know, 1/n, where does this come from? So, we said earlier that an impractical but in

Â some sense, gold standard hash function would be to just independently for each

Â key, assign it bucket uniformly and random with different keys being assigned

Â independently. Remember the reason this is not a practical hash function is because,

Â you'd have to remember where everybody went. And then that would basically

Â require maintaining a list which would devolve to the list solution, so you don't

Â want that. You want hash functions where you have to store almost nothing and we

Â can evaluate them in constant time. But, if we throw out those requirements of

Â small space and small time then, random function should spread stuff out pretty

Â evenly, right? I mean that's what they are doing. They're throwing darts completely

Â at random at these n buckets. So what would be the collision probability of two

Â given keys say, of Alice and of Bob if you are doing everything independently and

Â uniformly at random. Well, you know, first Alice shows up and it goes to some totally

Â random bucket, say, bucket number seventeen. Now, Bob shows up. So, what's

Â the probability that it collides with Alice? Well, we have these n buckets that

Â Bob could go to, each is equally likely, and there's a collision between Alice and

Â Bob if and only if Bob goes to bucket seventeen. Since, each bucket's equally

Â likely, that's only a one in n probability. So really what this condition

Â is saying, is that, for each pair of elements, the collision probability should

Â be as small, as good as with the holy grail of perfectly random hashing. So this

Â is a pretty subtle definition, perhaps the most subtle one that we see in this entire

Â course. So, to help you get some facility with this, and to force you to think about

Â it a little deeply, the next quiz which is probably harder than a typical in class

Â quiz, asks you to compare this definition of universal hashing with another

Â definition and ask you to figure out to what extent they're the same definition.

Â So the correct answer to this quiz question is the third one that there are

Â hash function families h, that satisfy the condition on this slide that are not

Â universal. On the other hand, there are hash function families H, which satisfy

Â this property and are universal. So, I'm going to give you an example of each. I'd

Â encourage you to think carefully about why this is an example and a non-example

Â offline. So an easy example to show that sometimes the answer is yes, you have

Â universal hash function families h, which also satisfy this property of the slide,

Â would be to just take H to be the set of all functions from mapping the universe to

Â the number of buckets. So that's an awful lot of functions, that's a huge set, but

Â it's a set nonetheless. And, by symmetry of having all of the functions, it both

Â satisfies the property of this slide. It is indeed true that exactly a one on one

Â refraction of all functions, map arbitrary key k to arbitrary bucket i. And by the

Â same reasoning, by the same symmetry properties, this is universal. So really,

Â if you think about it choosing a function at random function from H, is now just

Â choosing a completely random function. So it's exactly what we've been calling

Â perfect, random hashing. And as we discussed in the last slide, that would

Â indeed have a collision probability of exactly 1/n for each pair of distinct

Â keys. So, this shows sometimes you can have both this property and be universal.

Â An example where you have the property in this slide, but you're not universa l,

Â would be to take h to be a quite small family, a family of exactly n functions,

Â each of which is a constant function. So it's going to be one function which always

Â maps everything to bucket zero, that's a totally stupid hash function. There's

Â going to be another hash function which always maps everything to bucket number

Â one, that's a different but also totally stupid hash function, and so on. And then

Â the end function will be the constant function that always maps everything to

Â bucket n-1. And if you think about it, this very silly set H does indeed satisfy

Â this very reasonable looking property on this slide. Fix any key, fix any bucket,

Â you know say bucket number 31 what's the probability that you pick a hash function

Â that maps this key to bucket number 31? Well, independent of what the key is, it's

Â going to be the probability that you pick the constant hash function whose output is

Â always 31. Since there's n different constant functions, there's a one in n

Â probability. So, that's an example showing that in some sense, this is not as useful

Â a property as the property of universal hashing. So this is really not what you

Â wanted. This is not strong enough. Universal hashing, that's what you want

Â for strong guarantees. So now that we've spent some time trying to assimilate

Â probably the subtlest definition we've seen so far in this class, let me let you

Â in on a little secret about the role of definitions in mathematics. So on the one

Â hand, I think mathematical definitions often get short shrift, especially in, you

Â know, the popular discussion of mathematical research. That said, you

Â know, it's easy to come up with one reason why that's true, which is that any schmo

Â can come up and write down an mathematical definition. Nobody's stopping you. So,

Â what you really need to do is you need to prove that a mathematical definition is

Â useful. So how do you indicate usefulness of a definition? Well you gotta do two

Â things. First of all, you have to show that the definition is satisfied by

Â objects of interest. For us right now, objects of interest, are hash functions,

Â we might imagine implementing. So they should be easy to store, easy to evaluate.

Â So there better be such hash functions meaning, that complicated universal hash

Â function definition. The second thing is, is something good better happen if you

Â meet the definition. And in the context of hashing, what good thing do we want to

Â have happen? We want to have good performance. So those are the two things

Â that I owe you in these lectures. First of all, a construction of practical hash

Â functions that meet that definition, that's what we'll start on right now.

Â Second of all, why meeting that definition is a sufficient condition for good hash

Â table performance. That will be the next video. So in this example, I'm going to

Â focus on IP addresses although the hash function construction is general, as I

Â hope will be reasonably clear. And as many of you know, an IP address is 32 bit

Â integer consisting of four different eight bit parts. So let's just go ahead and

Â think of an IP address as a four two fold, the way you often see it. And since each

Â of the four parts is eight bits, it's going to be a number between zero and 255.

Â And the hash function that we're going to construct, it's really not going to be so

Â different than the quick and dirty functions as we talked about in the last

Â video although in this case we'll be able to prove that the hash function family is

Â in fact, universal. And we're again going to use the same compression function.

Â We're going to take the modulas with respect to a prime number of buckets. The

Â only difference is we're going to multiply these xi's by a random set of

Â coefficients. We're going to take a, a random linear combination of x1, x2, x3

Â and x4. So I'm going to be a little more precise. So we're going to choose a number

Â of buckets, n, and as we say over and over, the number of buckets should be

Â chosen so it's in the same ball park of the number of objects you are storing. So

Â you know, let's say that n should be roughly double the number of objects that

Â you are storing as initial rule of thumb. So, for example, maybe we only want to

Â maintain something in the ball park of 500 IP addresses and we can choose n to be a

Â prime like 997. So here's the construction. Remember, we want to produce

Â not just one hash function, but the definition is about a universal family of

Â hash functions. So we need a whole set of hash functions that we're ultimately going

Â to chose one member from, at random. So, how do we construct a whole bunch of hash

Â functions in a simple way? Here's how we do it. So you define one hash function,

Â which I'm going to note by h sub a, a here is a four tuple. The components of which

Â I'm going to call a1, a2, a3 and a4. And, all of the components of a are integers

Â between zero and n-1. So they're exactly in correspondence with the indices of the

Â buckets. So if we have 997 buckets, then each of these ai's is an integer between

Â zero and 996. So it's clear that this defines, you know, a whole bunch of

Â functions. So in fact, for each of the four coefficients, that's four independent

Â choices, you have n options. Okay so each of the integers between zero and n-1 for

Â each of the four coefficients. So that's fine, not giving a name to end of the four

Â different functions, but what is any given function? How do you actually evaluate one

Â of these functions? Just remember what a hash function is supposed to do. Remember

Â you know, how it type checks it takes as input something from the universe in this

Â case an IP address, and outputs a bucket number. And the way we evaluate the hash

Â function h sub a, and remember a here is a 4-tuple. And remember IP address is also a

Â 4-tuple, okay, so each component of the IP address is between zero and 255. Each

Â component of a is between zero and n-1, so for example, between zero and 996. And

Â what we do is just take the dot products or the inner products of the vector a and

Â the vector x, and then we take the modulus with respect the number of buckets. So

Â that is we take a1 x1 + a2 x2 + a3 x3 +a4 x4. Now of course, remember the

Â x's lie be tween zero and 255, the ai's lie between the zero and n-1, so say zero

Â and 996, you know, so you do these form of multiplications now make it a pretty big

Â number, you might well over shoot the number of buckets n. So to get back in the

Â range of what the buckets are actually indexed that in the end we take the

Â module, modulus the number of buckets. So in the end we do output, a number between

Â zero and n-1 as desired. So that's a set of a whole bunch of hash functions, n to

Â the fourth hash functions. And each one meets the criteria of being a good hash

Â function from an implementation perspective, right? So remember, we don't

Â want to have to store much to evaluate a function. And for a given hash function in

Â this family, all we gotta remember are the coefficients, a1, a2, a3 and a4. So you

Â just gotta remember these four numbers. And then to evaluate a hash function on an

Â IP address, we clearly do a constant amount of work. We just do these four

Â multiplications, the three additions, and then taking the modulus by the number of

Â buckets n. So it's constant time to evaluate, constant space to store. And

Â what's cool is, using just these very simple hash functions which are constant

Â time to evaluate and constant space to store, this is already enough to meet the

Â definition of a universal family of hash functions. So this fulfills the first

Â promise that I owed you, after subjecting you to that definition of universal

Â hashing. Remember the first promise was, there are simple, there are useful

Â examples that meet the definition, and then of course, I still owe you why.

Â Meaning this definition is useful, why does it leave the good performance. But I

Â want to conclude, this video of actually proving this theorem to you, arguing that

Â this is, in fact, a universal family of hash functions. Right. So this should be a

Â mostly complete proof and certainly will have all of the conceptual ingredients of

Â why the proof works There will be one spot where I'm a little hand-wavy because we

Â need a little number theory, and I don't want to have a big detour into number

Â theory. And if you think about it, you shouldn't be surprised that basic number

Â theory plays at least some role. Like as I said, we should choose the number of

Â buckets to be prime. So that means at some point in the proof, you should expect us

Â to use the assumption that n is prime. And pretty much always you're going to use

Â that assumption will involve at least elementary number theory, okay? But I'll

Â be clear about where I'm being hand-wavy. So what do we have to prove? Let's just

Â quickly review a definition of a universal hash function. So we have our set h that

Â we, that we know exactly what it is. What does it mean that it's universal? It means

Â for each pair of distinct keys, so in our context it's for each pair of IP

Â addresses, the probability that a random hash function from our family script h

Â causes a collision, maps these two IP addresses to the same bucket should be no

Â worse than with perfectly random hashing. So no worse than 1/n where n is the number

Â of buckets, say like 997. So, the definition we need to meet is a condition

Â for every pair of distinct keys. So let's just start by fixing two distinct keys. So

Â I'm going to assume for this proof that these two IP addresses differ in their

Â fourth component. That is that I'm going to assume that x4 is different than y4. So

Â I hope that it's intuitively clear that, you know, it shouldn't matter, you know,

Â which, which set of 8-bits I'm looking at. So they're different IP addresses. They

Â differ somewhere. If I really wanted, I could have four cases that were totally

Â identical depending on whether they differ in the first eight bits, the next 8-bits,

Â the next 8-bits, or the last 8-bits. I'm going to show you one case, because the

Â other three are the same. So let's just think of the last 8-bits as being

Â different. And now, remember what the definition asked us to prove. It asked us

Â to prove that the probability that these two IP addresses are going to collide is

Â at most, 1/n. So we need an upper bound on the collision probability w ith respect to

Â a random hash function from our set of n to the fourth hash functions. So I want to

Â be clear on the quantifiers. We're thinking about two fixed IP addresses. So

Â for example, the IP address for the New York Times website and the IP address for

Â the CNN website. We're asking for these two fixed IP addresses, what fraction of

Â our hash functions cause them to collide, right? We'll have some hash functions

Â which map the New York Times and CNN IP addresses to the same bucket, and we'll

Â have other hash functions which do not map those two IP addresses to the same bucket.

Â And we're trying to say, that the overwhelming majority, sends them to

Â different buckets, only a 1/n fraction at most, sends them to the same bucket. So

Â we're asking about the probability for the choice of a random hash function from our

Â set h that the function maps the two IP addresses to the same place. So the next

Â step is just algebra. I'm just going to take this equation which indicates when

Â the two IP addresses collide over a hash function. I'm going to expand the

Â definition of a hash function, remember it's just this inner product modulo the

Â number of buckets n, and I am going to rewrite this condition in a more

Â convenient way. Alright, so after the algebra, and the dust has settled. We're

Â left with this equation being equivalent to the two IP addresses colliding. So

Â again, we're interested in the fraction of choices of a1, a2, a3, and a4, such that

Â this condition holds, right? Sometimes it'll hold for some choices of the ai's,

Â sometimes it won't hold for other choices and we're going to show that it almost

Â never holds. Okay, so it fails for all but a 1/n fraction of the choices of the ai's.

Â So next we're going to do something a little sneaky. This trick is sometimes

Â called the Principle of Deferred Decisions. And the idea is when you have a

Â bunch of random coin flips, it's sometimes convenient to flip some but not all of

Â them. So sometimes fixing parts of the randomness clarifies the role that the

Â remaining randomness is going to play . That's what's going to happen here. So

Â let's go ahead and flip the coins, which tell us the random choice of a1, a2, and

Â a3. So again remember, in the definition of a universal hash function, you analyze

Â collision probability under a random choice of a hash function. What does it

Â mean to choose a random hash function for us? It means a random choice of a1, and

Â a2, and a3, and a4. So we're making four random choices. And what I'm saying is,

Â let's condition on the outcomes of the first three. Suppose we knew, that a1

Â turns up 173, a2 shows up 122 and a3 shows up 723, but we don't know what a4 is. A4

Â is still equally likely to be any of zero, one, two all the way up to n-1. So

Â remember that what we want to prove is that at most 1/n fraction of the choices

Â of a1, a2, a3, and a4, cause this underlined equation to be true, cause a

Â collision. So what we're going to show is that for each fixed choice of a1, a2, and

Â a3, at most a 1/n fraction of the choices of a4 cause this equation to hold. And if

Â we can show that for every single choice of a1, a2, and a3, no matter how those

Â random coin flips come out, almost a 1/n fraction of the remaining outcomes satisfy

Â the equation, then we're done. That means that at most of 1/n fraction of the

Â overall outcomes can cause the equation to be true. So if you haven't seen the

Â principle of, for these decisions before, you might want to think about this a

Â little bit offline, but it's easily justified by just say two lines of

Â algebra. Okay, so we're done with the setup and we're ready for the meat of the

Â argument. So we have done is, we've identified an equation which is now in

Â green, which occurs if and only if we have a collision between the two IP addresses.

Â And the question we need to ask is, for a fixed choices of a1, a2 and a3, how

Â frequently will the choice of a4 cause this equation to be satisfied? Cause a

Â collision? Now, here is why we did this trick of the Principle of Deferred

Â Decisions. By fixing a1, a2, and a3, the right hand side of this equation is now

Â just some fixed number b etween zero and n-1. So maybe this is 773, right? The xi's

Â were fixed upfront, the yi's were fixed upfront. We fixed a1, a2, a3 at the

Â beginning, at the end of the last slide, and those were the only ones involved in

Â the right hand side. So this is 773 and over on the left hand side, x4 is fixed,

Â y4 is fixed but a4 is still random. This is an integer equally likely to be any

Â value between zero and n-1. Now here's the key claim, which is that the left-hand

Â side of this green equation is equally likely to be any number between zero and

Â n-1. And I'll tell you the reasons why this key claim is true. Although this is

Â the point where we need a little bit of number theory, so I'll be kind of

Â hand-wavy about it. So there's three things we have going for us, the first is

Â that x4 and y4 are different. Remember our assumption at the beginning of the proof

Â was that, you know, the IP addresses differ somewhere so why not just assume

Â that they differ in the last 8-bits of the proof. Again this is not important if you

Â really wanted to be pedantic you could have three other cases depending on the

Â other possible bits in which the IP addresses might differ. But anyway, so,

Â because x4 and y4 are different, what that means is that x4 - y4 is not zero. And in

Â fact, now that I write this, it's jogging my memory of something that I should have

Â told you earlier, and forgot, which is that the number of buckets n should be at

Â least as large as the maximum coeffcient value. So for example, we definitely want

Â the number of buckets n in this equation to be bigger than x4, and bigger than y4.

Â And the reason is, otherwise you could have x4 and y4 being different from each

Â other, but they still, the difference still winds up being zero mod-n. So for

Â example, suppose n was four, and x4 was six and y4 was ten. Then x4-10 would be

Â -four and that's actually zero modulo four. So that's getting now what you want.

Â You want to make sure that if x4 and y4 are different, then they're difference is

Â non-zero modulo n. And the way you ensure that is that you just make sure n is

Â bigger than each. So you should choose a number of buckets bigger than the maximum

Â value of the coefficient. So in our IP address example, remember that the

Â coefficient don't get bigger than 255. And I was suggesting a number of buckets equal

Â the same 997. Now, in general, this is never a big deal in practice, if you only

Â wanted to use say, 100 buckets, you didn't want to use 1000, you wanted 100, well,

Â then you could just use smaller coefficients, right, you could just break

Â up the IP address, instead of into 8-bit chunks, you could break it into 6-bit

Â chunks, or 4-bit chunks, and that would keep the coefficient size smaller than the

Â number of buckets, okay? So you could choose the buckets first, and then you

Â choose how many bits to chunk your data into, and that's how you make sure this is

Â satisfied. So back to the three things we have going for us in trying to prove this

Â key claim. So x4 and y4 are different, so their difference is non-zero modulo n. So

Â second of all, n is prime, that was part of the definition, part of the

Â construction. And then third, a4, this final coefficient is equally likely to

Â take on any value between zero and n-1. So, just as a plausibility argument, let

Â me give you a proof by example. Again, I don't want to detour into elementary

Â number theory, although it's beautiful stuff, so you know, I encourage those of

Â you who are interested to go learn some and figure out exactly how you prove it.

Â You really only need the barest elementary number theory to give a formal proof of

Â this. But just to show you that is true in some simple examples, so let's think about

Â a very small prime. Let's just say there's seven buckets and let's suppose that the

Â difference between x4 and y4 is two. Okay, so having chosen the parameters of set n =

Â seven, I've set the difference equal to two. What I want to do is I want to step

Â through the seven possible choices of a4, and look at what we get in this blue

Â circle quantity, on the left hand side of the green equation. So, we want to say the

Â left hand's equally likely to be any of the seven numbers between zero and six, so

Â that means that as we try our seven different choices for a4, we better get

Â the seven different possible numbers as output. So, for example, if we set a4 =

Â zero, then the blue circled quantity is certainly itself zero. If we set it equal

Â to one, then it's one two, so we hit two. For two, we get two two which is

Â four. For three, we get three two which is six. Now, when we set a4 = four, we get

Â four two which is eight, modulo seven is one. Five two - seven is three. Six

Â two - seven is five. So as we cycle through a4, zero through six, we get the

Â value zero, two, four, six, one, three, five. So indeed we cycle through the seven

Â possible outcomes one by one. So if a4 is chosen uniformly and random, then indeed

Â this blue circled quantity will also be uniformly random. So just to give another

Â x4 and y4. Again, we have no idea what it is, other than that its non-zero. So, you

Â know, maybe instead of three, maybe, maybe instead of two, it's three. So now again,

Â let's stop through the seven choices of a4, and see what we get. So now we're

Â going to get zero, then three, then six, then two, then five, and then one, and

Â then four. So again, stepping through the seven choices of a4, we get all of the

Â seven different possibilities of this left hand side. And it's not an accident that

Â these choices are parameters. As long as n is prime, x4 and y4 are different, and y

Â ranges over all possibilities, so will the value on the left-hand side. So by

Â choosing a four uniformly random, indeed, the left-hand side is equally likely to be

Â any of its possible values, zero, one, two up to n-1. And so, what does that mean?

Â Well, basically it means that we're done with our proof cuz remember, the

Â right-hand side that circled in pink is fixed. We fixed a1, a2, and the a3. The

Â x's and y's have been fixed all along so this is just some number, like 773. And

Â so, we know that there's exactly one choice of a4 that will cause the left-hand

Â side to also be equal to 773. Now a4 has n different possible values and it's equally

Â likely to take one on becaus e only a one-end chance that we're going to get the

Â unlucky choice of a4 that causes the left-hand side to be equal to 773 and of

Â course, there's nothing special about 773. Doesn't matter how the right-hand side

Â comes out. We have only one-hand chance of being unlucky and having a collision and

Â that is exactly the condition we are trying to prove and that establishes the

Â universality of this function each of n^4, very simple, very easy to evaluate hash

Â