0:03

So let's just look at a little bit of the context of hashing in practical

Â applications. As I mentioned, it's very widely used. So here's an, here's an

Â example right from Java. The first. Implementation of Java 1.1. The designers

Â found that the cost of computing the hash function for strings seemed to be

Â excessive, particularly for long strings. And that was one of the main uses of

Â hashing, was just to be able to do searching with string keys. And so what

Â they decided in the first implementation was let's just look at every eighth or

Â ninth character, and that way, we don't have to spend a lot of time computing the

Â hash function. So they had a hash function pretty much like the one that we use.

Â Except that it compute a skip that would mean that, that only look at about every

Â eight key and they wouldn't have to do quite so much work performing the hash

Â function. And that's diffidently one thing to consider when using hashing is that the

Â cost of computing the hash function for a complicated key might exceed the cost of

Â searching and using a simpler structure like a binary search tree. And anyway for

Â Java 1.1 what happened was that there was a huge potentail for really bad collision

Â patterns on typical data. So here's the example of typical data, which is a URL.

Â All of these keys, which are totally different, would wind up having the same

Â collision. And so client programs and system programs on the Java system were

Â having terrible performance on their symbol table because of the shortcut in

Â hashing. So this well illustrates that you need to use all of the data in the hash

Â function and sometime we do a closer analysis. The cost of computing the hash

Â function can mean that something like red black trees will even outperform hashing

Â even for just searching and insert. So there is another thing about the uniform

Â hashing assumption is that it is an assumption and if you are writing code

Â where we have to have guaranteed performance like when your aircraft is

Â landing or you are controlling a nuclear reactor or somebody's pa cemaker. That, if

Â that assumption doesn't hold and you get bad performance you're going to have

Â disastrous consequences. So that's another reason to think about maybe paying a

Â little extra and using to guarantee that you get with red black search trees.

Â 2:47

Instead of hashing. And there's another surprising situation that happens in

Â today's world. For example Java publishes its hash function. And so if you're trying

Â to provide a service over the web. An adversary can learn your hash function and

Â just send you data that causes huge performance problem by just making all

Â that data hash to one particular item. And that's definitely something to worry

Â about. And, and in the real world you can nowadays find on the web particular

Â sequences of keys that will cause particular services to crash. And again,

Â that's a little harder to do with something like a red black tree where we

Â have performance guarantees. When you make an assumption you better be sure and

Â you're depending on that assumption, you better be sure that it holds somehow. This

Â is different than for example for quick sort when we, our assumption was we're

Â going to create randomness and we are going to depend on that randomness. In

Â this care we're kind of hoping for randomness and maybe that doesn't really

Â always hold. So that's certainly something to be aware of when using hashing in

Â practice. So here's just simple example on hashing in java. So what we can do is it's

Â pretty easy to find a family of strings that have the same hash code for example

Â with just a little fooling around now days you can just look it up on the web, you

Â can see that these two character keys, both have the same hash code because when

Â you just do the math in a base 31 hash code it'll tell you that answer. Well what

Â that means is that actually, just like working in binary you got, you can combine

Â those things. In all possible ways, and you can get two to the N strings, for any

Â N of length to N that all hash to the same value. And somebody's implemented a

Â service in Java that it uses a simp le table that takes string keys, you can

Â cause that to crash in this way. Little bit scary for some systems designers. At

Â least reason for pause in using hashing. Now, hashing also has a extremely

Â important application in today's Internet commerce. And so the, it's the concept of

Â so called one way hash functions which mean that we, we, use it for secure to try

Â to be, have some secure fingerprints for use on the web. And there's been a lot

Â research done to develop functions that take keys as input, and then produce

Â values that look random. In such a way that, it's hard for someone else to find

Â another key that collides with that. This technology is, is useful for storing

Â passwords and digital fingerprints and things. But it's too expensive for use, in

Â a symbol table. So the bottom line is separate chaining versus linear probin

Â collision resolution message methods. Now there's a number of considerations to take

Â into account. Separate chaining is really easy to implement both insert and delete

Â it performs, it degrades, it does so gracefully and the clustering is, is maybe

Â less of a problem if you have a bad hash function. Linear probing tends to make

Â better use of space. And also it'll perform better for huge tables whereas

Â caching is involved. And if, in the classic algorithm or computer science

Â problems for people to think about is what do we do to delete in these two situations

Â and exactly how do we resize. Those are all at the level of exercises in the

Â context of the kinds of algorithms that we've seen. And as I mentioned, there's

Â been many, many improved versions of hashing that have been studied. I

Â mentioned the two probe, or double hashing version. Another way to use two hash

Â functions is just to hash the two positions and put the key in the shorter

Â of the two chains. In, in that case, then the expected length of the longest chain

Â will be lg, lg N which is quite an improvement. You get constant time

Â expected and lg, lg N worst case. Double hashing is the variant of layer probing

Â where you just skip a variable amount, not one each time. And that pretty much wipes

Â out clustering but it, it is more difficult to implement delete for that

Â one. In a new method called, relatively new method called Cuckoo Hashing. It's

Â another variant of linear probing where we hash a key to two positions and insert the

Â key in either one. If occupied you, you reinsert the displaced key into its

Â alternative. It was in one, each one can go to two. And that one actually gives

Â constant worst case time for search. That's another variation on the team. And

Â all of these things allow us to make better use of memory, allows the table to

Â become nearly full. It would have been very exciting. Thing to be researchers in

Â the 1950's who cared so much about memory and nowadays a little extra memory is not

Â something that people care about so much and most people just go with the easy

Â algorithm except for really performance critical applications. What about hash

Â tables versus balance search trees? Well hash tables are really simple to code

Â usually if you don't have to do the hash function. And if you don't have order in

Â the keys at all then you need the compare to, to implement balance search trees. So

Â you have to use hashing if you don't have the comparison. And it'll probably be

Â faster for simple keys to use hashing. It's a few arithmetic operations to do the

Â hash versus lg N and compares for the balance tree. And there's some better

Â system support in Java for strings that cache hash code means that you don't even

Â have to compute the hash if your, your simple table for strings is in an inner

Â loop. On the other hand, balanced search trees have a much stronger performance

Â guarantee. It, you don't have to assume anything. It's going to be less than lg N

Â and compares and it's got support for all those ordered ST operations, and compared

Â to and is pretty easy and natural function to implement. So it's more flexible and

Â more broadly useful. And actually the Java system and other systems include both so

Â that programmers can make use of either one in diff erent situations. That's our

Â context for hashing algorithms.

Â