Hash functions are a cornerstone of many important cryptographic processes. In this lesson we'll look at the fundamental concepts behind hash functions and some of the terms that we need to understand in order to have meaningful discussions about them. We'll also see that not all hash functions are created equal. They're used for many different things, and the very properties that make a specific hash function suitable for some things may make it unsuitable for others. Of course, we want functions that are useful for cryptography. But before we can describe in detail the properties that are most important for that use, we need to understand how we use these functions and how our adversaries attack them. These are all topics of upcoming lessons. In this lesson, we'll prepare the stage by looking at one particular idealized but ultimately unrealizable hash function called the random oracle. This is used for designing and analyzing cryptographic systems. We need to be familiar with this model and how it behaves because we often first design the overall system to be secure under the assumption that we are using a random oracle before turning to the vulnerabilities presented by the actual hash function we plan to use. A hash function is a function that takes an input, which in cryptography we usually refer to as the message, and produces an output that we usually call the hash, or the digest of the message. In theory, the message can be any length, from a single bit to an unlimited number of bytes. In practice, many hash functions place limits on the message length, so that the length of the message can be incorporated into the hash. But these limits are quite generous. For instance, SHA-1, which is been around since the early 1990s, allow messages ranging from a single bit to more than two million terabytes. The message digest, on the other hand, is almost always a fairly small and fixed width. Cryptographic digests are typically between 128 and 512 bits. Though, both smaller and larger cryptographic hash functions do exist. For perspective, a 256-bit digest can be written as string of 64 hexadecimal digits, which can be comfortably printed on a single line of text. But even so, the number of possible digests is literally astronomical in size. If we could represent each bit using just a single atom, we would require half the mass of the entire known universe to represent all of the possible 256-bit digests. We also use some terms for mathematical set theory when talking about hash functions. The hash function output is called an image of the input. One trait that nearly all hash functions have is that they are many to one mappings, meaning that there are many messages that produce the same hash value. When two messages produce the same digest, it is known as a hash collision. Again, from set theory, the set of all inputs that yield the same output, or the set of all messages that collide with each other, are collectively referred to as the pre-image. Even when the maximum size is limited, the size of each pre-image is effectively infinite. For instance, if we had a hash function that limited the message length to just 264 bits, or about half a line of text, there are more possible messages than there are atoms in the known universe. And each additional bit doubles the number of possible messages. As already mentioned, hash functions are used in many different fields. We'll briefly look at a few, including cryptography, to provide a mental framework within which to compare and contrast some of the properties. Probably the most prolific use of hash functions are the checksums and cyclical redundancy check algorithms that are used to guard against accidental corruption due to noise in data networks. Every packet of data on an Ethernet or wi-fi, and nearly all other networksm have some type of hash value appended to it so that every receiving device such as a router, a wireless access point or your computer, can detect the vast majority of errors that might have occurred. In some wireless networks in high noise environments, a significant fraction of all data packets may have to be rejected because they fail to pass these checks. A typical high speed router in your home has to be capable of performing millions of these computations every second. So these functions have to be very simple and very fast. And as a consequence, they provide virtually no protection against a malicious attacker. Another extremely common use of hash functions, particularly in this age of big data, are in data storage and retrieval systems. When you have an extremely large database, think Amazon or eBay, you need to be able to find a specific item in the database extremely quickly. So what you do is you take the description by which you will look for it in the future and you run it through a hash function. You then use the output from the hash function as the index into an array, and you store the item at that location. This is called a hash table. Later, when you want to retrieve the data, you again run the description through the hash function in order to get the array index where it is stored. Thus, you can go directly to that item largely independent of how many items are in your table. To work well, we need the hash values to be fairly uniformly distributed so that our entries are spread out across the hash table. This requires a hash function that is considerably more complicated than a checksum or a CRC function. But we don't care whether the hash function of an item leaks information about the item that was hashed. Heck, we might even find that to be a useful feature. But this would be extremely bad for a cryptographic hash function. Although not nearly as demanding as the functions used in high-speed communication links, speed is still extremely in huge highly active data bases. We want to be able to take a hash of a message very quickly, and we are probably willing to sacrifice its performance in other areas to get it. This is usually the opposite of our priorities in cryptography in which we typically have to pay a significant penalty to get acceptable performance in the areas we care about. Another type of hash function are those used in similarity testing, referred to as locality sensitive hash functions. These type of functions produce hash values that are relatively close to each other if the messages are very similar. One application for something like this is in acoustic fingerprinting of music files in which we want to be able to take a fingerprint of someone humming a tune and then compare it to nearby hashes of music files to try to find a match. Other applications in this area include chemical fingerprint databases and genetic databases. The very features that make these hash functions useful for this purpose make them singularly unacceptable for cryptography since we want there to be no discernible pattern in the hashes of any two messages regardless of how similar might be, unless of course they are identical. There are many other uses, but we will leave it this. So what about these cryptographic hash functions? Let's not forget the goal of cryptography, secret writing. We need a hash function that leaks as little information about the message as possible. Thus, given a digest, we don't want an adversary to be able to say anything about the message that produced it. We want it to be non-invertable, or so-called one-way. Later, we will see that we need it to be very difficult for an attacker to find another message that has the same digest as the original, or even to find any two messages that share the same digest. Another way of saying this is that we want it to be very hard to find hash collisions. So without concerning ourselves whether it can actually be built, how would an ideal cryptographic hash function behave? Imagine a black box that has an input where we feed in any message we like. The box also has an output that spits out the digest of the complete message. Now imagine that the contents of the box behave in a way that is completely indistinguishable from a mythical being known as an oracle, who, in response to each message, produces a truly random digest subject only to the constraint that any time the same message is input the same digest is output. To visualize how this could be achieved, at least in principle, the oracle simply maintains a list of every message that has ever been input and the digest associated with it. Then, every time a message is entered, the oracle looks through its list for that exact message. And if it finds it, it merely echoes out the associated digest. But if that exact message is not already in the list, it generates a truly random digest, perhaps by flipping a fair coin for each bid, and then adds the new message digest pair to its list before outputting the digest. Because the value of the digest associated with a particular message does not depend on that message in any way, the messages and their digests are uncorrelated, meaning that there is no pattern that would allow an attacker to take just a digest and define anything about the message that might have produced it. It also means that if they want to find a hash collision, the only thing they can possibly do is carry out a brute force attack and keep trying different messages until they find one that happens to result in the desired digest. That's it for this lesson. So let's briefly review what we've learned and what's still to come. A hash function takes an arbitrary length message and produces a fixed-length hash. Different hash functions have different needs, and cryptographic hash functions are among the most demanding. A useful model for the ideal cryptographic hash function is the Random Oracle. In future lessons, we'll look at how we use hash functions to achieve message integrity and authenticity, how an adversary can attack a hash function, and the primary properties that a good cryptographic hash function needs to have.