0:00

In this module, we're gonna talk about a new concept called collision resistance,

Â which plays an important role in providing message integrity. Our end goal is to

Â describe a very popular MAC algorithm called HMAC, that's widely used in

Â internet protocols. HMAC itself, is built from collision resistant hash functions.

Â Before we do that, let's do a quick recap of where we are. In the last module we

Â talked about message integrity where we said that the MAC system is secure if it's

Â existentially unforgeable under a chosen message attack. This means that even an

Â attacker who is given the tag on arbitrary messages of his choice cannot construct a

Â tag for some new message. And then we showed that in fact any secure PRF

Â immediately gives us a secure MAC. And so then we turned around and said well can we

Â build secure PRFs that take large messages as inputs? And so we looked at four

Â constructions. The first construction was based on CBC, we call it when we looked at

Â two variants of it, one called encrypted CBC and one called CMAC. And we said that

Â these are commonly used with AES. In fact, in the 802.11i standard, a CBC-MAC is used

Â for message integrity. In particular, with the AES algorithm. We looked at another

Â mode called NMAC, which also converts a PRF for short inputs into a PRF that's

Â capable of taking very large messages as inputs. And these two were both sequential

Â MAC-s. We then looked at the parallelizable MAC called PMAC which again was able to

Â convert a PRF for small inputs into a PRF for very large inputs. But it did so in a

Â parallel fashion so a multiprocessor system would be more efficient with PMAC

Â than, say, with CBC-MAC. All three of these built a MAC for large messages by

Â constructing a PRF for large messages. And then we looked at the Carter-Wegman MAC

Â which is actually not a PRF. It's a randomized MAC so a single message could

Â actually have many, many different valid tags and therefore a Carter-Wegman MAC is

Â actually not a PRF. And if you remember, the Carter-Wegman MAC was built by

Â first of all, taking the bulk message and hashing it down to a small tag using a

Â fast one-time MAC and then encrypting that tag using a PRF. The benefit of the

Â Carter-Wegman MAC was that, as we said, the hashing of the bulk message is done using

Â a fast one time MAC. And then in this module, we're gonna construct MAC-s from

Â this new concept called collision resistance. And so the first thing we're

Â gonna do is construct collision resistant hash functions. So let's first of all

Â start by defining what does it mean for a hash function to be collision resistant.

Â So think of a hash function from some message space to a tag space T, and you

Â should be thinking of the message space as much, much bigger than the tag space. So

Â the messages could be gigabytes long, but the tags would only be like 160 bits. Now

Â a collision for the function H is a pair of messages m0, m1, that happen to be

Â distinct. However, when you apply the function H to them, you end up with the

Â same output. So the image you should have in your mind is essentially there are two

Â inputs, m0 and m1, and they belong to this gigantic message space. However, when we

Â apply the hash function to them, they happen to collide and they both map to the

Â same output in the tag space. Now we say that the function is collision resistant

Â if it's hard to find collisions for this function. Now this should seem a little

Â bit counterintuitive because. We know that the output space is tiny compared to the

Â input space. So, by the pigeonhole principle, there must be lots and lots and

Â lots of messages that map to the same output. Just because there isn't enough

Â space in the output space to accommodate all the messages without collisions. And

Â so, we know that there are lots of collisions, and the question is, is there

Â an efficient algorithm that finds any such collisions explicitly. So we say the, the

Â function is collision resistant, if, for all explicit efficient algorithms A. And

Â these algorithms are not able to print the collision for the function H, okay? And as

Â usual, we'll define the advantage as the probability that the algorithm A is able

Â to output a collision. Now I'm not gonna formalize a term explicit here. All I'll

Â say is that it's not enough to just say that an algorithm exists, and algorithm

Â that simply prints a collision. Cause we know many collisions exist. What we

Â actually want is an explicit algorithm that we can actually write down and run on

Â a computer to generate these collisions. There are ways to formalize that, but I'm

Â not gonna do that here. Now a classic example of a collision resistant hash

Â function is SHA-256 which happens to output at 256 bits but can take

Â arbitrary large input. For example, it can take gigabytes and gigabytes of data and

Â it will map it all to 256 bits. And yet nobody knows how to find collisions for

Â this particular function. So just to show you that this concept of collision

Â resistance is very useful, let's see a very quick application for it. In

Â particular, let's see how we can trivially build a MAC given a collision resistant

Â hash function. So, suppose we have a MAC for short messages. So you should be

Â thinking something like AES, which can MAC sixteen byte messages. And then, suppose

Â we have a hash function, a collision resistant hash function from a large message

Â space, that contains gigabyte messages into our small message space. Say, into

Â sixteen byte outputs. Then, basically, we can define a new MAC, let's call it Ibig,

Â which happens to be MAC-ing large messages. And we'll define it simply by

Â applying the small MAC to the output of the hash function. And how do we verify a

Â MAC? Well, basically, given a tag we verify it by rehashing the given message

Â and then checking that small MAC actually verifies under the given tag. Okay, so

Â this is a very simple way to show you how collision resistance can take a primitive

Â that's built for small inputs and expand it into a primitive that's built for very

Â large inputs. And in fact we're going to see this not just for MAC-s. Later on when

Â we talk about digital signatures, we're going to do the same thing. We're going to

Â build a digital signature scheme for small inputs and then we're going to use

Â collision resistance to expand the input space and make it as large as we want. So

Â the security theorem basically isn't something that's trivial here. Basically

Â it says if the underlying MAC is secure and H is collision resistant, then the

Â combination which can actually MAC large messages, is also a secure MAC. And as

Â a quick example, let's apply this to AES. So let's use the one example that we

Â mentioned, SHA-256. So in this case SHA-256 outputs 256 bit outputs,

Â which is 32 bytes. So we have to build a MAC that can MAC 32 byte messages. And the

Â way we could do that is basically by applying the sixteen byte AES, plugging it

Â into a two block CBC. A two block CBC would expand AES from a PRF on sixteen

Â bytes to a PRF on 32 bytes. And then take the output of SHA-256 and plug it into

Â this two block CBC based on AES. And then we get a very, very simple, MAC which is

Â secure assuming AES is a PRF and SHA-256 is collision resistant. So one thing I

Â wanted to point out is that in fact collision resistance is necessary for this

Â construction to work. So in fact, collision resistance is not just a made-up

Â term. Collision resistance really is kind of the essence of why this combined MAC is

Â secure. And so let's just assume for a minute that the function H, the hash

Â function we're using, is not collision resistant. In other words, there is an

Â algorithm that can find two distinct messages that happen to map to the same

Â output. In this case, the combined MAC. Is not going to be secure because what the

Â adversary can do is simply use a chosen message attack to get a tag for m0. And

Â then output m1 comma that tag as a forgery and indeed T is a valid MAC for m1 because

Â H(m1) happens to be equal to H(m0). And so in doing so just with a one chosen

Â message attack the attacker was able to break this combined MAC simply because the

Â hash function was not collision resistant. So it should be, again I want to emphasize

Â that if someone advertised even one collision for SHA-256, you know two

Â messages, just one pair of messages that happen to have the same output under

Â SHA-256 that would already make this construction insecure cause an attacker

Â could then ask for the tag on one message and in doing so he would obtain the tag on

Â the other message as well, and that would count as an existential forgery. Okay, so

Â already, we see the collision resistance is a very useful primitive, in that it

Â lets us expand the input space of cryptographic primitives. I wanna show you

Â one more application where a collision resistance is directly used for message

Â integrity. Imagine again, we have files that we wanna protect. Let's imagine these files

Â are actually software packages that, we wanna install on our system. So here are

Â three different software packages. You know, maybe one is GCC, on is Emacs, and

Â another one is, I don't know, VI. Now the user wants to download the software

Â package and he wants to make sure that he really did get a version of the package

Â that he downloaded and it's not some version that the attacker tampered with

Â and modified its contents. So what he could do is basically refer to a read-only

Â public space that's relatively small. All it has to do is hold small hashes of these

Â software packages. So there isn't a lot of space needed here. The only requirement is

Â that this space is read-only. In other words, the attacker cannot modify the

Â hashes stored in this space. And then, once he consults this public space, he can

Â very easily compute the hash of a package that he downloaded. Compare it to the

Â value in the public space. And if the two match. Then he knows that the version of

Â the package he downloaded is, in fact, a correct one. Why is that true? Well,

Â because the function H is collision resistant. The attacker cannot find an F1

Â path, they would have the same hash as F1. And as a result, the attacker cannot

Â modify F1 without being detected because there's no way that the hash of his F1

Â [inaudible] would map to the value that's stored in the public space. So, the reason

Â I brought up this example is, I wanted to contrast this with the MAC example. We

Â kinda saw a similar situation with MAC-s. In the MAC example, we needed a secret key

Â to verify the individual file tags. But we didn't need a resource that was a read

Â only public space. With collision-resistant hashes, we kind of get

Â the exact compliment where in fact we don't need a key to verify. Anyone can

Â verify. You don't need a secret key for it. However, now all of a sudden we need

Â this extra resource which is some space that the attacker cannot modify. And, in

Â fact, later on, we're gonna see that with digital signatures, we can kind get to the

Â best of both worlds, where we have both public verifiability, and we don't need a

Â read only space. But so far, with either MAC-s or collision resistance, we can have

Â one, but not the other. So, I'll tell you that, in fact, this kind of scheme is very

Â popular. In fact, Linux distributions often use public space where they

Â advertise hashes of their software packages. And anyone can make sure that

Â they downloaded the right software package before installing it on their computer. So

Â this is, in fact, something that's used quite extensively in the real world. Okay,

Â so in the next segment we'll talk about generic attack on collision resistance and

Â then we'll go ahead and build collision resistant hash functions.

Â