0:00

Let's begin by building up some intuition about what we would want from a hash

Â function, now that we know how hash functions are usually implemented. So

Â let's start with a hash function which is implemented by chaining. So what's going

Â to be the running time of our lookup, insert, and delete operations in a hash

Â table with chaining? Well, the happy operation in a hash table with chaining is

Â insertion. Insertion, we can just say without any qualifications, is constant

Â time. This requires the sort of obvious optimization that when you do insert, you

Â insert something at the front of the list in its bucket. Like, there's no reason to

Â insert at the end. That would be silly. So the plot thickens when we think about the

Â other two operations, deletion and the lookup. So let's just think about lookup.

Â Deletion's basically the same. So how do we implement lookup? Well, remember when

Â we get some key x, we invoke the hash function. We call h(x). That tells us

Â what bucket to look in. So if it tells us 17, we know that, you know, x may

Â or may not be in the hash table. But at this point we know that if it's in the

Â hash table, it's got to be in the linked list that's in the seventeenth bucket. So

Â now we descend into this bucket. We find ourselves a linked list. And now, we have

Â to resort to just an exhaustive search through this list in the seventeenth

Â bucket, to see whether or not X is there. So we know how long it takes to search a

Â regular list for some element. It's just linear and the list length. And now we're

Â starting to see why the hash function might matter. Right, so suppose we insert

Â 100 objects into a hash table with 100 buckets. If we have a super lucky hash

Â function, then perhaps each bucket will get its own object. There'll be one object

Â in each of the lists, in each of the buckets, so. Theta of the list length

Â is just theta of one. We're doing great. Okay? So, a constant, constant link to lists,

Â means constant time insert delete. A really stupid hash function would map

Â every single object to bucket number zero. Okay, so then if you insert 100

Â objects, they're all in bucket number zero. The other 99 buckets are empty. And

Â so every time you do insert or delete, it's just resorting to the naive linked list

Â solution. And the running time is going to be linear and the number of

Â objects in the hash table. So the largest list length could vary anywhere from m/n,

Â where m is the number of objects, this is when you have equal linked lists,

Â to if you use this ridiculous constant hash function, m, all the objects in the

Â same list. And so the main point I'm trying to make here, is that you know,

Â first of all, at least with chaining, where the running time is governed by the

Â list length, the running time depends crucially on the hash function. How well

Â the hash function distributes the data across the different buckets. And

Â something analogous is true for hash tables that use open addressing. Alright

Â so here there aren't any lists. So you don't, there's no linked lists to keep

Â track of. So the running time is going to be governed by the length of the probe

Â sequence. So the question is how many times do you have to look at different

Â buckets in the hash table before you either find the thing you're looking for,

Â or if you're doing insertion, before you find an empty bucket in which to insert

Â this new object. So the performance is governed by the length of the probe

Â sequence. And again, the probe sequence is going to depend on the hash function. For

Â really good hash functions in some sense, stuff that spreads data out evenly, you

Â expect probe sequences to be not too bad. At least intuitive, And for say the

Â constant function you are going to expect these probe sequences to grow linearly

Â with the numbers of object you insert into the table. So again this point remains

Â true, the performance of a hash table in either implementation really depends on

Â what hash function you use. So, having built up this intuition, we can now say

Â what it is what we want from a hash function. So first we want it to lead to

Â good performance. I'm using the chaining implementations as a guiding example. We

Â see that if we have a size of a hash table, n, that's comparable to the number

Â of objects, m, it would be really cool if all of the lists had a length that was

Â basically constant; therefore we had our constant time operations. So equal length

Â lists is way better than unequal length lists in a hash table with chaining. So

Â we, we want the hash function to do is to spread the data out as equally as possible

Â amongst the different buckets. And something similar is true with open

Â addressing; in some sense you want hash functions to spread the data uniformly

Â across the possible places you might probe, as much as possible. And in

Â hashing, usually the gold standard for spreading data out is the performance of a

Â completely random function. So you can imagine for each object that shows up you

Â flip some coins. With each of the n buckets equally likely, you put this

Â object in one of the n buckets. And you flip independent coins for every different

Â object. So this, you would expect, you know, because you're just throwing darts

Â at the buckets independently, you'd expect this to spread the data out quite well.

Â But, of course, it's not good enough just to spread data out. It's also important

Â that we don't have to work too hard to remember what our hash function is and to

Â evaluate it. Remember, every time we do any of these operations, an insert or a

Â delete or a lookup, we're going to be applying our hash function to some key x.

Â So every operation includes a hash function evaluation. So if we want our

Â operations to run a constant time, evaluating the hash function also better

Â run in constant time. And this second property is why we can't actually

Â implement completely random hashing. So there's no way we can actually adjust when

Â say we wanted to insert Alice's phone number, flip a new batch of random coins.

Â Suppose we did. Suppose we flipped some random coins and it tells us to put

Â Alice's phone number into the 39th bucket, while. Later on, we might do a

Â lookup for Alice's phone number, and we better remember the fact that we're

Â supposed to look in the 39th bucket for Alice's phone number. But what does that

Â mean? That means we have to explicitly remember what choice we made. We have to

Â write down. You get a list of effects that Alice is in bucket number 39. In every

Â single insertion, if they're all from in the point of coin flips, you have to

Â remember all of the different random choices independently. And this really

Â just devolves back to the naive list base solution that we discussed before. So,

Â evaluating the hash function is gonna take us linear time and that defeats the

Â purpose of a hash table. So we again we want the best of both worlds. We want a

Â hash function which we can store in ideally constant space, evaluate in

Â constant time, but it should spread the data out just as well as if we had this

Â gold standard of completely random hashing. So I want to touch briefly on the

Â topic of how you might design hash functions. And in particular good hash

Â functions that have the two properties we identified on the previous slide. But I

Â have to warn you, if you ask ten different, you know, serious hardcore

Â programmers, you know, about their approach to designing hash functions,

Â you're likely to get ten somewhat different answers. So the design of hash

Â functions is a tricky topic, and, it's as much art as science at this point. Despite

Â the fact that there's a ton of science, there's actually a very beautiful theory,

Â about what makes good hash functions. We'll touch on a little bit of that in a, in a

Â different optional video. And if you only remember one thing of, you know, from this

Â video or from these next couple slides, the thing to remember is the following.

Â Remember that it's really easy to inadvertently design bad hash functions

Â and bad hash functions lead to poor hash table performance. Much poorer than you

Â would expect given the other discussion we've had in this video. So if you have to

Â design your own hash function, do your homework, get some examples, learn what

Â other experts are doing and use your best judgment. If you do just something without

Â thinking about it, it's quite possible to lead to quite poor performance, much

Â poorer than you were expecting. So to drive home this point, suppose that you're

Â thinking about keys being phone numbers. So let's say, you know, I'm gonna just be

Â very kinda United States centric here. I'm just gonna focus on the, the ten digit

Â phone numbers inside the US. So the universe size here is ten to the ten,

Â which is quite big. That's probably not something you really want to implement

Â explicitly and let's consider an application where, you know, you're only

Â say, keeping track of at most, you know, 100 or 1,000 phone numbers or something

Â like that. So we need to choose a number of buckets, let's say we choose 1,000

Â buckets. Let's say we're expecting no more than 500 phone numbers or so. So we double

Â that, we get a number of buckets to be equal 1,000. And now I got to come up with

Â a hash function. And remember, you know, a hash functions by definition. All it does

Â is map anything in the universe to a bucket number. So that means it has to

Â take as input a ten digit phone number and spit as output some number between zero

Â and 999. And, beyond that we have flexibility of how to define this mapping.

Â Now, when you are dealing with things that have all these digits it's very tempting

Â to just project on to a subset of the digits. And, if you want a really terrible

Â hash function, just use the most significant digits of a phone number to

Â define a mapping from phone numbers to buckets. Alright, so I hope it's clear why

Â this is a terrible choice of a hash function. Alright, so maybe you're a

Â company based in the San Francisco Bay area. The area code for San Francisco is

Â 415. So if you're storing phone numbers from customers in your area. You know

Â maybe twenty, 30, 40 percent of them are gonna have area codes 415. All of those

Â are going to hash to exactly the same bucket, bucket number 415 in this hash

Â table. So you're gonna get an overwhelming percentage of the data mapping just to

Â this one bucket. Meanwhile you know not all 1000 possibilities of, of these three

Â digits are even legitimate area codes. Not all three digit numbers are area codes in

Â the United States. So there'll be buckets of your hash table which are totally

Â guaranteed to be empty. So you waste a ton of space in your hash table, you have a

Â huge list in the bucket corresponding to 415, you have a huge list in the

Â bucket corresponding to 650, the area code at Stanford. You're gonna have a very slow

Â look up time for everything that hashes to those two buckets and there's gonna be a

Â lot of stuff that hashes to those two buckets, So terrible idea. So a better but

Â still mediocre hash function would be to do the same trick but using the last three

Â digits instead of the first three digits. This is better than our terrible hash

Â function because there aren't ridiculous clumps of phone numbers that have exactly

Â the same last three digits. But still, this is sort of assuming you're using this

Â hash function as tantamount to thinking that the last three digits of phone

Â numbers are uniformly distributed among all of the 1,000 possibilities. And really

Â there's no evidence if that's true. Okay? And so there's going to be patterns and

Â phone numbers that are maybe a little subtle to see with the naked eye, but

Â which will be exposed if you try to use a mediocre hash function like this. So let's

Â look at another example. Perhaps you are keeping track of objects just based by

Â where they are laid out in memory. So in other words the key for an object is just

Â gonna be its memory location and if these things are in bytes, then you are

Â guaranteed that every memory location will be a multiple of four. So for a second

Â example let's think about a universe where the possible keys are the possible memory

Â locations, So here you're just associating objects with where they're laid in memory,

Â and a hash function is responsible for taking in as input some memory location of

Â some object and spitting out some bucket number. Now generally, because of, you

Â know the structure of bytes and so on, our memory locations are going to be multiples

Â of some power of two. In particular, memory locations are going to be even, And

Â so a bad choice of a hash function. Would be to take, remember, the hash function

Â takes the input of the memory location, which is, you know, some possibly really

Â big number, and we wanna compress it, we want to output a bucket number. Now let's

Â think of a hash table where we choose N equals 10^3, or 1000 buckets.

Â So then the question is, you know, how is this hash function going to take this big

Â number, which is the memory location, and squeeze it down to a small number. Which

Â is one of the buckets and so let's just use the same idea as in the mediocre hash

Â function, which is we're gonna look at the least significant bits so we can express

Â that using the mod operator. So let's just think about we pick the hash function h(x)

Â where h is the memory location to be x mod 1000 There again, you know, the

Â meaning of 1,000 is that's the number of buckets we've chosen to put in our hash

Â table because, you know, we're gonna remember roughly at most 500 different

Â objects. So don't forget what the mod operation means, this means you just,

Â essentially subtract multiples of 1,000 until you get down to a number less than

Â 1,000. So in this case, it means if you write out x base ten, then you just take

Â the last three digits. So in that sense, this is the same hash function as our

Â mediocre hash function when we were talking about phone numbers. So we

Â discussed how the keys here are all going to be memory locations; in particular

Â they'll be even numbers. And here we're taking their modulus with respect to an

Â even number. And what does that mean? That means every single output of this hash

Â function will itself be an even number. Right, you take an even number, you

Â subtract a bunch of multiples of a 1000, you're still going to have an even number.

Â So this hash function is incapable of outputting an odd number. So what does

Â that mean? That means at least half of the locations in the hash table will be

Â completely empty, guaranteed, no matter what the keys you're hashing is. And

Â that's ridiculous. It's ridiculous to have this hash table 50 percent of which is

Â guaranteed to be empty. And again, what I want you to remember, hopefully long after

Â this class is completed is not so much these specific examples, but more the

Â general point that I'm making. Which is, it's really easy to design bad hash

Â functions. And bad hash functions lead to hash table performance much poorer than

Â what you're probably counting on. Now that we're equipped with examples of bad hash

Â functions. It's natural to ask about, you know, what are some good hash functions?

Â Well it's actually quite tricky to answer that question. What are the good hash

Â functions, and I'm not really going to answer that on this slide. I don't promise

Â about hash functions that I'm going to tell you about right now, are good in a

Â very strong sense of the word. I will say these are not obliviously bad hash

Â functions, they're let's say, somewhat better hash functions. And in particular

Â if you just need a hash function, and you need a quick and dirty one, you don't want

Â to spend too much time on it. The method that I'm going to talk about on this slide

Â is a common way of doing it. On the other hand, if you're designing a hash function

Â for some really mission-critical code, you should learn more than what I'm gonna tell

Â you about on this slide. So you, you should do more research about what are the

Â best hash functions, what's the state of the art, if you have a super important

Â hash function. But if you just need a quick one what's, what we say on this

Â slide will do in many, in most situations. So the design of a hash function can be

Â thought of as two separate parts. So remember by definition a hash function

Â takes as input something from the universe. An IP address, a name, whatever

Â and spits out a bucket number. But, it can be useful to factor that into two separate

Â parts. So first you take an object which is not intrinsically numeric. So,

Â something like s string or something more abstract. And you somehow turn an object

Â into a number, possibly a very big number. And then you take a possibly big number and you map it to a much smaller number,

Â namely the index of a bucket. So in some cases I've seen these two steps given the

Â names like the first step is formulating the hash code For an object, and then the

Â second step is applying a compression function. In some cases, you can skip the

Â first step. So, for example, if your keys are social security numbers, they're

Â already integers. If they're phone number, they're already integers. Of course, there

Â are applications where the objects are not numeric. You know, for example, maybe

Â they're strings, maybe you're remembering names. And so then, the production of this

Â hash code basically boils down to writing a subroutine that takes, as input, a

Â string, and outputs some possibly very big number. There are standard methods for

Â doing that, it's easy to find resources to, to give you example code for

Â converting strings to integers you know, I'll just say one or two sentences about

Â it. So you know each character in a string it is easy to regard as a number in

Â various ways. Either you know just say it is ASCII, well ASCII code then you just

Â have to aggregate all of the different numbers, one number per character into

Â some overall number and so one thing you can do is you can iterate over the

Â characters one at a time. You can keep a running sum. And with each character, you

Â can multiply the running sum by some constant, and then add the new letter to

Â it, and then, if you need to, take a module list to prevent overflow. And the

Â point of me giving you this one to two sentence of the subroutine is just to give

Â you a flavor of what they're like, and to make sure th at you're just not scared of

Â it at all. Okay? So it's very simple programs you can write for doing things

Â like converting from strings to integers. But again, you know, I do encourage you to

Â look it up on the Web or in a programming textbook, to actually look at those

Â examples. Okay? But there are standard methods for doing it. So that leaves the

Â quest, question of how to design this compression function. So you take as input

Â this huge integer. Maybe your keys are already numeric, like Social Security

Â numbers or IP addresses. Or maybe you've already some subroutine to convert a

Â string, like your friend's name, into. Some big number, but the point is you have

Â a number in the millions or the billions, and you need to somehow take that and

Â output one of these buckets. And again think of there being maybe a thousand or

Â so buckets. So the easiest way to do that is something we already saw in a previous

Â slide, which is just to take the modulus, With respect to the number of buckets. Now

Â certainly one positive thing you can say about this compression function is its

Â super simple, Both in the sense that it's simple to code and in the sense that it's

Â simple to evaluate. Now remember that was our second goal for a hash function. It

Â should be simple to store, it is actually nothing to store. And it should be quick

Â to evaluate. And this certainly fits the bill. Now the problem is, remember the

Â first. Property of a hash function that we wanted is that it should spread the data

Â out equally. And what we saw in the previous slide is that at least if you

Â choose the number of buckets N poorly, then you can fail to have the first

Â property. And in that respect you can fail to be a good hash function. So if for

Â example if N is even and all of your objects are even, then it's a disaster,

Â all of the odd buckets go completely empty. And honestly, you know, this is a

Â pretty simplistic method. Like I said, this is a quick and dirty hash function.

Â So, no matter how you choose the number of buckets N, it's not gonna be a perfect

Â hash function in any sense of the word. That said, there are some rules of thumb

Â for how to pick the number of buckets, how to pick this modulus, so that you don't

Â get the problems that we saw on the previous slide. So, I'll conclude this

Â video just with some standard rules of thumb. You know, if you just need a quick

Â and dirty hash function, you're gonna use the, the modulus compression function, how

Â do you choose N? Well, the first thing is we definitely don't want to have the

Â problem we had in the last slide, where we're guaranteed to have these empty

Â buckets no matter what the data is. So what went wrong in the previous slide?

Â Well. The problem is that all of the data elements were divisible by two. And the

Â hash function modulus, the number of buckets, was also divisible by two. So

Â because they shared a common factor, namely two, that guaranteed that all of

Â the odd buckets remained empty. And this is a problem, more generally, if the data

Â shares any common factors with N, the number of buckets. So, in other words, if

Â all of your data elements are multiples of three, and the number of buckets is also a

Â multiple of three, you got a big problem. Then everything's gonna hash into bucket

Â numbers which are multiples of three, too, that's if your hash table will go

Â unfilled. So the upshot is, you really want the number of buckets to not share

Â any factors With the data that you're hashing. So, how do you reduce the number

Â of common factors? Well, you just make sure the number of buckets has very few

Â factors, which means you should choose N to be a prime number, 'kay? A number that

Â has no nontrivial factors, And let me remind you, the number of buckets should

Â also be comparable to the size of the set that you're planning on storing. Again, at

Â no point did we need "N" to be, you know, very closely connected to the number of

Â elements that you're storing just within, say some small constant factor. And you

Â can always find a prime within a small constant factor of a target number of

Â elements to store. If the number of buckets in your hash table isn't too big,

Â if it's just in say the thousands or maybe the tens of thousands, then, you know, you

Â can just look up a list of all known primes up to some point, and you can just

Â sort of pick out a prime which is about the magnitude that you're looking for. If

Â you're gonna use a really huge number of buckets in the millions or more, then

Â there are algorithms you can use for primarily testing which will help you find

Â a prime in about the range that you're looking for. >> So that's the first order

Â rule of thumb you should remember if you're using the modulus compression

Â function, which is set the number of buckets equal to a prime. So you're

Â guaranteed to not have non-trivial common factors of the modulus shared by all of

Â your data. So there's also a couple of second order optimizations, which people

Â often mention. And you also don't want the prime; you want the prime to be not too

Â close to patterns in your data. So what does that mean, Patterns in your data?

Â Well, in the phone number example we saw that patterns emerged in the data when we

Â expressed it base ten. So for example, there is crazy amounts of Lumping in the

Â first three digits when we expressed a phone number-based ten, Because that

Â corresponded with the area code. And then, with. Memory locations when we express,

Â express it base two, there are crazy correlations in the low orbits. And these

Â are the two most common examples. Either there's some digit, to the base ten

Â representation or digits in the base two representation where you have, you know,

Â patterns that is non-uniformity. So that. Suggests that the prime, that, N that you

Â choose, you know, all else being equal, shouldn't be too close to a power of two,

Â and shouldn't be too close to a power of ten. The thinking being that, that will

Â spread more evenly data sets that do have these patterns in terms of base two

Â representation, or base ten representations. So in closing, this is a

Â recipe I recommend for coding of hash functions if what you're looking to do is

Â sort of minimize program ming, programmer time, subject to not coming up with a hash

Â function, which is completely broken. But I want to reiterate, this is not the state

Â of the art in hash function design. There are hash functions which are in some sense

Â better than the ones that expand on this slide. If you're responsible for some

Â really mission critical code that involves a hash function, you should really study

Â more deeply than we've been able to do here. We'll touch on some issues in, of

Â the different optional video, but really you should do additional homework. You

Â should find out about the state-of-the-art about hash function design. You should

Â also look into implementations of open addressing in those probing strategies.

Â And above all you really should consider cold, coding up multiple prototypes and

Â seeing which one works the best. There's no silver bullet, there's no panacea in

Â the design of hash tables. I've given you some high-level guidance about the

Â different approaches. But ultimately it's gonna be up to you to find the optimal

Â implementation for your own application.

Â