So now we're gonna look at a very general paradigm called the Merkle-Damgard

paradigm, that's used for constructing collision-resistant hash functions. Before

we do that, let me just remind you what our goals are. So as usual we say that H

is a hash function that maps some large message space into a small tag space. And

we say that a collision for a hash function is basically a pair of distinct

messages that happen to map to the same value under this hash function. And our

goal is to build collision-resistant hash functions namely functions where it's hard

to find even a single collision. Even though we know many collisions exist. No

efficient algorithm can even output a single collision. So we're gonna construct

these hash functions, these collision resistant hash functions, in two steps.

The first step, which we're gonna do in this segment, is to show that if you give

me a collision resistant hash function for short messages, we can extend it and build

a collision resistant hash function for much, much, much longer messages. In the

next segment, we'll actually build collision-resistant HASH functions for

short messages. Okay so let's look at the construction. The construction is actually

very general and in fact all collision-resistant hash functions follow

this paradigm. It's actually a very elegant paradigm and let me show you how

it works. So here we have our function H which we're gonna assume is a

collision-resistant hash function for small sized inputs. So the way we're gonna

use this little function h, this h is sometimes called a compression function,

is we're gonna take a big message M and break this message in to blocks. And then

we use a fixed value called the IV. Here is the case the IV is fixed forever. And

it's basically embedded in the code and in the standards, it's just a fixed ID that's

part of the fin-nation of the function. And then what we do is we apply the small

compression function H to the first message block along with this ID. What

comes out of that is what's called a chaining variable that's gonna be fed into

the next compression function which compresses the next block along with the

previous chaining variable and out comes the next chaining variable, and the next

message block is compresses, and so on and so forth until we which the final message

block And then the final message block, the one special thing that we do, is that

we must append this padding block, this PB, which stands for padding block (and

I'll explain what the padding block is in just a second). But after we append the

padding block we again compress the last [inaudible] variable with the last message

block, and the output of that is the actual output of the hash function. So

just to summarize, in symbols, we have this compression function that elements in

T. This is the opposite of the hash function. Plus message blocks, this x

corresponds to message blocks, and outputs the next chaining variables. So as I said,

this is what the compression functions do. And then from this compression function

we're able to build a hash function for large inputs, namely inputs is the space x

to some power of l namely up to l blocks of x. And of course these are variable

lengths so we could have different length messages that are being given input to

this function h and what comes out of it is basically a tag in tag space. So the

only thing left to define is the padding block. So the padding block is actually

very important as we're gonna see in just a minute. What it is, is basically, well

it's a sequence of 1000 that denotes the end of the actual message block. And then

the most important part of the message block is that we encode the message length

In this padding block. And the message length field is basically fixed to be 64

bits. So in all the [inaudible] hash functions, so in all the [inaudible] hash

functions the maximum message length is two to the sixty four minus one so in fact

the message length fits into a 64 bit block. An upper bound of two to the sixty

four minus one bit on the message length is actually sufficiently long to handle

all of the messages we're gonna throw at it. Okay, so that's the padding block, and

of course the question is: what do we do if the last block really is a multiple of

the compression function of block length? Where are we gonna fit the padding block?

And the answer, as usual, is basically if there's no space for the padding block in

the last block of the message, then we're gonna have to ass another dummy block and

stick the padding block in there. And of course put the one zero, zero, zero in the

right place. Okay so the point is that it's very, very important that the padding

block contains the message length as we'll see in just a minute. So this is the

Merkle-Damgard construction, the last piece of terminology I'll say is that we

have these chaining variables. So H0 is the initial value. H1, H2, H3, up to H4

which is the actual final output of this hash function. So as I said, all standard

hash functions follow this paradigm for constructing a collision resistant hash

function from a compression function. The reason that this paradigm is so popular is

because of the following theorem, which says basically that if the little

compression function is collision resistant, then the big Merkle-Damgard

hash function is also collision resistant. So, in other words, if we're going to build

collision resistant functions for large inputs, all we have to do is just build compression

functions that are collision resistant. So let's prove this theorem. It's a elegant

proof and it's not too difficult. So the way we're gonna prove it is using the

contrapositive, that is, if you can find me a collision on the big hash function

then we're gonna deduce a collision on the little compression function. Therefore, if

little h is a collision resistant, so will be the big H. So suppose you give me a

collision on the big compression function, namely two distinct messages M and M',

that happen to hash to the same output, we're going to use M and M' to build

a collision on the little compression function. So the first thing is we

have to remember how the Merkle-Damgard construction works and, in particular,

let's assign names to the chain variables when we hash M versus when we hash M'.

So here are the chain variables that come up when we hash the message M, so H0 is the

fixed IV that starts the whole process, H1 would be the result of hashing the first

message block with the IV, and so on and so forth, until H sub t+1 is the

final chain variable, which is the final output of the Merkle-Damgard chain. And

then similarly for M', let's define H0' to be the first chaining variable, H1'

to be the result after compressing the first message block of M', and so

on and so forth, up until we get to H' of r+1, which is the final hash of

the message M'. Now you notice the length of the messages M and M'

don't have to be the same. In particular, M has length t, whereas M' has length r.

Now because the collision occurred, we know that these two values here are the

same. H(M) is equal to H(M'). In other words, the last chaining variables

are actually equal to one another. So now let's look carefully how H t+1 and

H' r+1 were calculated. Well H t+1 is calculated as the

compression function applied to the previous chaining variable along with the

last message block of M, including the padding block. You'll notice here I'm

assuming that the padding block fits with the last message block of M even if the

last padding block is in its own block, it's not going to make any difference. So

for simplicity, let's assume that the padding block fits with the last message

block with capital M. So, by hashing the last message block with a padding block

and the last chaining variable, we obtain H t+1 and, similarly, the same thing

happens with M'. By hashing the last message block, the last chaining variable,

we obtain H' r+1. And we know that, since these two values are equal, all

of a sudden we have a candidate collision for the compression function. Here, let me

circle the candidate collision, this is one part of it and this is the other part of

it. Okay, so we have an equality between two arguments given to the

compression function that happen to produce the same value. The only way we

would not get a collision is if the arguments happen to be the same. In other

words, what we're seeing here is if the arguments to the compression function are

distinct, then we get a collision for little h. So let's write it out more

precisely: so if H t is not equal to H' r, or Mt is not equal to M' r,

or the padding blocks are distinct, then we have a collision for the compression

function h and we're done, we're done and we can stop. So, the only way we need to continue is if

this three-way disjunction doesn't hold. So what does it mean that this disjunction

doesn't hold? well, so the only reason we need to continue is if the second-to-last

chaining variables are equal, the last blocks of the messages are equal and the

two padding blocks are equal. So what does it mean that the two padding blocks are

equal? Remember that the message lengths were encoded in the padding block. So, in

particular, this means that the length of M and the length of M' is the same,

namely the t is equal to r. So from now on I can assume t is equal to r. Whenever I

wrote r, I'm just gonna write t. But now we can apply exactly the same argument to the

second to last chaining variables. In other words, how was H t computed? Well

H t is computed by hashing the previous chaining variable, namely H t-1,

with the second to last message block. And similarly, H' t was computed, you

notice I replaced r by t, so H' t was computed by hashing the previous chaining

variable along with the second to last message block of M'. And because

these two are equal, now we get another candidate collision for the compression

function. In other words, if H t+1 is not equal to H' t-1, or M t-1 is not equal to

M' t-1, then basically we have a collision for h, and we can stop,

we're done. So now, the only case when we need to continue is if this condition over

here doesn't hold, namely, so let's suppose that H t-1 is equal to H' t-1

and we already know that, let's see, so M t is equal to M' t and

M t-1 is equal to M' t-1. Suppose it so happens that these two

conditions hold, well, you can see that now we can continue to iterate. In other

words, we can now apply exactly the same argument to H t-1 and H' t-1

and so iterating this again and again, we can iterate all the way to the

beginning of the message. Iterate to the beginning of the message, and one of two

things must hold, either we find the collision for the compression function