Hi. In the previous video, we've introduced polynomial hash family on strings and you learned how is hashCode implemented in Java. In this video, you'll learn why is it implemented that way, and more generally, why is the polynomial family of hash functions is so good for hashing strings. The lemma states that for any two different strings, s_1 and s_2 of length at most L plus 1, if you choose hash function from polynomial family P with index p at random, by selecting a random integer x from 1 to p minus 1, you do that, then the probability of collision on these two strings is at most L over p. The idea of the proof of this fact, this follows from another fact that the equation of when this polynomial a_0 plus a_1 times x plus a_2 times x squared plus, and so on, until a_L times x^L equal to 0 modulo p, this equation for any prime number p has at most L different solutions x, where x is a number between 0 and p minus 1, and remainder. Why it does the lemma follow from this? Well, because for two different strings, s_1 and s_2, their hash values will be two polynomials of degree at most L, and so the difference between their hash values will also be some polynomial of degree at most L in terms of variable x, and for a collision to happen, we need to get 0 modulo p. This leads to a polynomial being 0 modulo p. Now, x is actually some random number between 1 and p minus 1. We know that there are at most L different values of x, which give a solution to this equation. We will select such an x with probability at most L over p. Now, it follows from this lemma that by choosing a big enough prime p, we can make the probability of collision L over p small enough. However, it doesn't yet give us a hash function that can be used in a hash table in the chaining scheme. Because if we choose a hash function with cardinality equal to a huge prime number p, then the hash table size we'll also need to be equal to this huge prime number p, and we want to actually choose the hash table size according to the number of keys that we want to store in the table, and not just some very big prime number. We don't want to waste that much memory. We need to fix the cardinality of the resulting hash function. To do that, we're going to apply one more hash function on top of it. We take our string and first, we apply a random function h with index x from the polynomial hash family, and h_x is generated by choosing a random integer x between 1 and p minus 1. Then the resulting value is already an integer between 0 and p minus 1 because it's a remainder modulo p. We know from the previous videos that there is a universal family for integers between 0 and p minus 1, because p is prime. We can hash the resulting value again using a random function from the universal family of hash functions for such integers and we can denote the resulting hash function by h with index m because the resulting cardinality will be already m. To compute h_m, we first compute h_x, which is a random hash function from the polynomial family based on our string that can be done in time proportional to the length of the string s. Then we take the resulting value integer between 0 and p minus 1 and apply to it a random hash function h with indices a and b from the universal family. To choose this random function from the universal family, we just need to choose two parameters, integer a between 1 and p minus 1, integer b between 0 and p minus 1, which is also b, and this h_ab can be computed in constant time for any integer between 0 and p minus 1. The resulting function is hash function h_m, which has cardinality m already. I said that this hash function is a good hash function. The lemma states that for any two different strings, s_1 and s_2 of length at most L plus 1 and any cardinality m, the probability of collision between those two strings using hash function h_m generated the way I've just told you is at most 1 over m plus L over p. Why is that? Well, because if we have a collision and either we have a collision on the strings with function h_x, which happens with probability at most L over p, we already know that, or if this collision doesn't happen, then the resulting values between 0 and p minus 1 are different between strings s_1 and s_2. A collision must happen for the random hash function h with indices a and b from the universal family, and that we know happens with probability at most 1 over m. At least one of these events must happen, and so the probability is at most the sum of the probabilities of the other ones, 1 over m plus L over p. Now, we have a corollary that if p is big enough and more exactly, if p is bigger than n times L, then for any two different strings, s_1 and s_2 of length at most L plus 1, the probability of collision choosing h_m the way I've just told you is O of 1 over m. To prove that, we just see that 1 over m plus L over p is strictly less than 1 over m plus L over m times L, because p is bigger than m times L and L over m times L is just the same as 1 over m, so we get 2 over m, which is indeed O of 1 over m. From this, it follows that for prime p bigger than m times L, we have the length of the longest chain in our hash table on the order of 1 plus n over m, which is O of 1 plus Alpha again, because although we don't have the inequality that the probability of collision is less than 1 or m, it is still less than some constant over m and this doesn't change the asymptotics of the running time significantly. Computing the poly hash function of string s runs in the time proportional to the length of string s. If the lengths of the names in the phone book are bounded by constant L, then computing the hash function on all those names takes O of L, which is constant time. We see that now we can map from names to phone numbers using constant time to compute the hash function and constant time on average to insert a new contact or remove some contact or change a contact. In conclusion, you learned how to hash both integers and strings, and the phone book can be implemented as two hash tables. One is mapping from phone numbers to names using a universal family of hash functions for integers. Another mapping from names to phone numbers using first the polynomial family of hash functions for strings, and then on top of it, again, universal family of hash functions for integers. Search and modification in these hash tables and both of them will run in constant time on average. Also, the memory consumption will be approximately linear in the number of contexts that you're going to store. This is very good. You cannot do better in terms of memory consumption asymptotically and you cannot do better in terms of running time asymptotically at least on average. Of course, in the worst case, hash tables could be working slow, but on average, they're working very fast. Also, there are theorems that state that the probability they are going to work really slow is very small. That is why actually the Java recommendation of hashCode uses just a fixed x equal to 31. Because of course it could choose random x every time, but then it would not be deterministic, so we cannot actually do that, so we choose just 31 as example of random number x and then it turns out that still the probability that we'll get a particular bad input for this particular implementation of hashCode is very small. In practice, everything will run very fast and very smoothly. That's why hash code in Java is implemented as a polynomial hash function. In the next lecture, we are going to study a very different side of hash functions and very different side of hashing. We're going to use it to quickly find sub-strings in strings.