0:00

So now we're gonna look at a very general paradigm called the Merkle-Damgard

Â paradigm, that's used for constructing collision-resistant hash functions. Before

Â we do that, let me just remind you what our goals are. So as usual we say that H

Â is a hash function that maps some large message space into a small tag space. And

Â we say that a collision for a hash function is basically a pair of distinct

Â messages that happen to map to the same value under this hash function. And our

Â goal is to build collision-resistant hash functions namely functions where it's hard

Â to find even a single collision. Even though we know many collisions exist. No

Â efficient algorithm can even output a single collision. So we're gonna construct

Â these hash functions, these collision resistant hash functions, in two steps.

Â The first step, which we're gonna do in this segment, is to show that if you give

Â me a collision resistant hash function for short messages, we can extend it and build

Â a collision resistant hash function for much, much, much longer messages. In the

Â next segment, we'll actually build collision-resistant HASH functions for

Â short messages. Okay so let's look at the construction. The construction is actually

Â very general and in fact all collision-resistant hash functions follow

Â this paradigm. It's actually a very elegant paradigm and let me show you how

Â it works. So here we have our function H which we're gonna assume is a

Â collision-resistant hash function for small sized inputs. So the way we're gonna

Â use this little function h, this h is sometimes called a compression function,

Â is we're gonna take a big message M and break this message in to blocks. And then

Â we use a fixed value called the IV. Here is the case the IV is fixed forever. And

Â it's basically embedded in the code and in the standards, it's just a fixed ID that's

Â part of the fin-nation of the function. And then what we do is we apply the small

Â compression function H to the first message block along with this ID. What

Â comes out of that is what's called a chaining variable that's gonna be fed into

Â the next compression function which compresses the next block along with the

Â previous chaining variable and out comes the next chaining variable, and the next

Â message block is compresses, and so on and so forth until we which the final message

Â block And then the final message block, the one special thing that we do, is that

Â we must append this padding block, this PB, which stands for padding block (and

Â I'll explain what the padding block is in just a second). But after we append the

Â padding block we again compress the last [inaudible] variable with the last message

Â block, and the output of that is the actual output of the hash function. So

Â just to summarize, in symbols, we have this compression function that elements in

Â T. This is the opposite of the hash function. Plus message blocks, this x

Â corresponds to message blocks, and outputs the next chaining variables. So as I said,

Â this is what the compression functions do. And then from this compression function

Â we're able to build a hash function for large inputs, namely inputs is the space x

Â to some power of l namely up to l blocks of x. And of course these are variable

Â lengths so we could have different length messages that are being given input to

Â this function h and what comes out of it is basically a tag in tag space. So the

Â only thing left to define is the padding block. So the padding block is actually

Â very important as we're gonna see in just a minute. What it is, is basically, well

Â it's a sequence of 1000 that denotes the end of the actual message block. And then

Â the most important part of the message block is that we encode the message length

Â In this padding block. And the message length field is basically fixed to be 64

Â bits. So in all the [inaudible] hash functions, so in all the [inaudible] hash

Â functions the maximum message length is two to the sixty four minus one so in fact

Â the message length fits into a 64 bit block. An upper bound of two to the sixty

Â four minus one bit on the message length is actually sufficiently long to handle

Â all of the messages we're gonna throw at it. Okay, so that's the padding block, and

Â of course the question is: what do we do if the last block really is a multiple of

Â the compression function of block length? Where are we gonna fit the padding block?

Â And the answer, as usual, is basically if there's no space for the padding block in

Â the last block of the message, then we're gonna have to ass another dummy block and

Â stick the padding block in there. And of course put the one zero, zero, zero in the

Â right place. Okay so the point is that it's very, very important that the padding

Â block contains the message length as we'll see in just a minute. So this is the

Â Merkle-Damgard construction, the last piece of terminology I'll say is that we

Â have these chaining variables. So H0 is the initial value. H1, H2, H3, up to H4

Â which is the actual final output of this hash function. So as I said, all standard

Â hash functions follow this paradigm for constructing a collision resistant hash

Â function from a compression function. The reason that this paradigm is so popular is

Â because of the following theorem, which says basically that if the little

Â compression function is collision resistant, then the big Merkle-Damgard

Â hash function is also collision resistant. So, in other words, if we're going to build

Â collision resistant functions for large inputs, all we have to do is just build compression

Â functions that are collision resistant. So let's prove this theorem. It's a elegant

Â proof and it's not too difficult. So the way we're gonna prove it is using the

Â contrapositive, that is, if you can find me a collision on the big hash function

Â then we're gonna deduce a collision on the little compression function. Therefore, if

Â little h is a collision resistant, so will be the big H. So suppose you give me a

Â collision on the big compression function, namely two distinct messages M and M',

Â that happen to hash to the same output, we're going to use M and M' to build

Â a collision on the little compression function. So the first thing is we

Â have to remember how the Merkle-Damgard construction works and, in particular,

Â let's assign names to the chain variables when we hash M versus when we hash M'.

Â So here are the chain variables that come up when we hash the message M, so H0 is the

Â fixed IV that starts the whole process, H1 would be the result of hashing the first

Â message block with the IV, and so on and so forth, until H sub t+1 is the

Â final chain variable, which is the final output of the Merkle-Damgard chain. And

Â then similarly for M', let's define H0' to be the first chaining variable, H1'

Â to be the result after compressing the first message block of M', and so

Â on and so forth, up until we get to H' of r+1, which is the final hash of

Â the message M'. Now you notice the length of the messages M and M'

Â don't have to be the same. In particular, M has length t, whereas M' has length r.

Â Now because the collision occurred, we know that these two values here are the

Â same. H(M) is equal to H(M'). In other words, the last chaining variables

Â are actually equal to one another. So now let's look carefully how H t+1 and

Â H' r+1 were calculated. Well H t+1 is calculated as the

Â compression function applied to the previous chaining variable along with the

Â last message block of M, including the padding block. You'll notice here I'm

Â assuming that the padding block fits with the last message block of M even if the

Â last padding block is in its own block, it's not going to make any difference. So

Â for simplicity, let's assume that the padding block fits with the last message

Â block with capital M. So, by hashing the last message block with a padding block

Â and the last chaining variable, we obtain H t+1 and, similarly, the same thing

Â happens with M'. By hashing the last message block, the last chaining variable,

Â we obtain H' r+1. And we know that, since these two values are equal, all

Â of a sudden we have a candidate collision for the compression function. Here, let me

Â circle the candidate collision, this is one part of it and this is the other part of

Â it. Okay, so we have an equality between two arguments given to the

Â compression function that happen to produce the same value. The only way we

Â would not get a collision is if the arguments happen to be the same. In other

Â words, what we're seeing here is if the arguments to the compression function are

Â distinct, then we get a collision for little h. So let's write it out more

Â precisely: so if H t is not equal to H' r, or Mt is not equal to M' r,

Â or the padding blocks are distinct, then we have a collision for the compression

Â function h and we're done, we're done and we can stop. So, the only way we need to continue is if

Â this three-way disjunction doesn't hold. So what does it mean that this disjunction

Â doesn't hold? well, so the only reason we need to continue is if the second-to-last

Â chaining variables are equal, the last blocks of the messages are equal and the

Â two padding blocks are equal. So what does it mean that the two padding blocks are

Â equal? Remember that the message lengths were encoded in the padding block. So, in

Â particular, this means that the length of M and the length of M' is the same,

Â namely the t is equal to r. So from now on I can assume t is equal to r. Whenever I

Â wrote r, I'm just gonna write t. But now we can apply exactly the same argument to the

Â second to last chaining variables. In other words, how was H t computed? Well

Â H t is computed by hashing the previous chaining variable, namely H t-1,

Â with the second to last message block. And similarly, H' t was computed, you

Â notice I replaced r by t, so H' t was computed by hashing the previous chaining

Â variable along with the second to last message block of M'. And because

Â these two are equal, now we get another candidate collision for the compression

Â function. In other words, if H t+1 is not equal to H' t-1, or M t-1 is not equal to

Â M' t-1, then basically we have a collision for h, and we can stop,

Â we're done. So now, the only case when we need to continue is if this condition over

Â here doesn't hold, namely, so let's suppose that H t-1 is equal to H' t-1

Â and we already know that, let's see, so M t is equal to M' t and

Â M t-1 is equal to M' t-1. Suppose it so happens that these two

Â conditions hold, well, you can see that now we can continue to iterate. In other

Â words, we can now apply exactly the same argument to H t-1 and H' t-1

Â and so iterating this again and again, we can iterate all the way to the

Â beginning of the message. Iterate to the beginning of the message, and one of two

Â things must hold, either we find the collision for the compression function

Â