POV-Ray: Newsgroups: povray.off-topic: The nebulous question of probability

POV-Ray : Newsgroups : povray.off-topic : The nebulous question of probability		Server Time 11 Oct 2024 13:17:54 EDT (-0400)

Goto Latest 10 Messages

Next 10 Messages >>>

From: Invisible
Subject: The nebulous question of probability
Date: 15 Nov 2007 06:30:58
Message: <473c2df2@news.povray.org>

Quoting RFC #1321, Section 1:

"It is conjectured that it is computationally infeasible to produce
two messages having the same message digest, or to produce any
message having a given prespecified target message digest."

This conjecture has now been determined to be false. In fact, a single 
laptop can perform both these tasks in a few minutes using a suitable 
algorithm.

However, one might also conjecture that the probability of any two 
arbitrary messages having the same [MD5] hash code would be 2^(-128).

Does the existence of a collision-generation algorithm for MD5 
contradict this second conjecture?

(In othe words, it is possible to *maliciously* alter a message without 
affecting the MD5 hash, but what is the probability of an *accidental* 
modification going undetected? Is it 2^(-128)? Or is it some higher 
probability?)

Post a reply to this message

From: Invisible
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 09:06:13
Message: <473c5255$1@news.povray.org>

Invisible wrote:

> However, one might also conjecture that the probability of any two 
> arbitrary messages having the same [MD5] hash code would be 2^(-128).

Actually, according to the Birthday Paradox, the probability in this 
case is much lower.

All I *really* want to know is the probability of a random alteration to 
a file going undetected by an MD5 hash comparison. But this depends on 
just how "random" MD5 really is. Apparently it's not something anybody 
has looked at much. The existence of a collision algorithm seems to 
suggest that it's not all that random, but I don't know...

Post a reply to this message

From: Warp
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 10:41:09
Message: <473c6895@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> All I *really* want to know is the probability of a random alteration to 
> a file going undetected by an MD5 hash comparison.

  I think it's quite unlikely.

  I wonder how much more unlikely it would be if you calculated two MD5
codes: The first one is calculated in the regular way, and the second one
is calculated starting from the second byte in the file forwards, and taking
the first byte as if it was the last (iow. effectively as if you had removed
the first byte of the file and appended it to the end of the file).

  A random change which would keep one of the MD5 sums the same would
quite unlikely keep the other one the same too. I have no idea about the
probabilities, though.

-- 
                                                          - Warp

Post a reply to this message

From: Invisible
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 11:30:09
Message: <473c7411$1@news.povray.org>

Warp wrote:
> Invisible <voi### [at] devnull> wrote:
>> All I *really* want to know is the probability of a random alteration to 
>> a file going undetected by an MD5 hash comparison.
> 
>   I think it's quite unlikely.

I think it's astronomically unlikely. I'd just like to be able to put a 
hard number on that though. ;-)

>   I wonder how much more unlikely it would be if you calculated two MD5
> codes: The first one is calculated in the regular way, and the second one
> is calculated starting from the second byte in the file forwards, and taking
> the first byte as if it was the last (iow. effectively as if you had removed
> the first byte of the file and appended it to the end of the file).
> 
>   A random change which would keep one of the MD5 sums the same would
> quite unlikely keep the other one the same too. I have no idea about the
> probabilities, though.

Alternatively, take the MD5 *and* the SHA-1 hash as well. Or better yet, 
take the MD4, MD5, SHA-1, SHA-256, WIRLPOOL and RIPEMD hashes.

But let's face it, "md5sums" is a psuedo-standard file format that is 
quite widely recognised.

The real reason we're interested in this is for checking that CDs are 
readable. Currently, we "check" the CD by looking at the directory 
listing and confirming that every file is present and has the correct 
size in bytes.

In other words, we check that the Table Of Contents on the CD has been 
burned correctly. :-P

In a sense, the important point about MD5 is that checking a CD this way 
ensures that every single bit of data is physically read from the 
surface of the disk. If you can do that, the data is probably correct 
anyway...

(Let us not forget that the CD itself puts the data through two seperate 
layers of Reed-Solomon codes anyway, not to mention filesystem metadata 
probably has CRC32 checksums in there.)

If a CD is defective, you're far more likely to get read/write errors 
from the operating system than actual garbage data, so perhaps the 
actual hash value is irrelevant.

By the way, do you know how MD5 works? It's kind of interesting. I'm 
currently trying to implement it in Haskell - but I'm being thwarted 
because it uses that brain-deadness known as "little-endian integers". 
(These really shouldn't exist...)

Basically, you take your file and split it into 512-bit blocks, padding 
the final block to make it exactly 512 bits. (Actually it's slightly 
more complicated than that.) Then, starting from the hash value

   0123456789ABCDEFFEDCBA9876543210

you process each block using the MD5 "compression function", which 
updates the hash value. Add the new hash value to the previous hash 
value to get a final output, and then repeat for the next block. When 
all blocks have been processed, whatever the final hash value is, *that* 
is the MD5 hash value.

In particular, if you change one particular message block in such a way 
that the compression function maps it to the same value, then the MD5 
hash of the whole file will be unchanged. So you really want to 
scrutinise that compression function...

The compression function itself consists of a loop repeated 16 times, 
each time with different parameters. At each loop, you rotate the hash 
round, and mutate one of the 32-bit words within it. You do this by 
adding a mash value to it; this mash value is derived from

- A table of constants (derived from the sin(x) function)
- A bit-wise combination of the other 32-bit words in the hash. (The 
exact combination varies throughout the loop.)
- A 32-bit word from the input block. (The exact one is determined by 
the loop counter, but with a nontrivial relationship.)

The mash value is also bit-rotated by a loop-specific amount first.

And that, my friends, is basically the MD5 algorithm... It certainly 
*looks* like it should give your data a good mashing, but on the other 
hand, there are loads of places where you can say "hmm, if somehow the 
same numbers got here, the results would be identical".

Post a reply to this message

From: Invisible
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 11:33:10
Message: <473c74c6$1@news.povray.org>

Invisible wrote:

> The compression function itself consists of a loop repeated 16 times, 
> each time with different parameters.

Gah! 64 times... >_<

Maybe I should just stop typing now? :-S

Post a reply to this message

From: Nicolas Alvarez
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 11:40:26
Message: <473c767a$1@news.povray.org>


> By the way, do you know how MD5 works?

I would assume Warp knows... :)

Post a reply to this message

From: Darren New
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 16:39:29
Message: <473cbc91$1@news.povray.org>

Invisible wrote:
> Quoting RFC #1321, Section 1:
> 
> "It is conjectured that it is computationally infeasible to produce
> two messages having the same message digest, or to produce any
> message having a given prespecified target message digest."
> 
> This conjecture has now been determined to be false. In fact, a single 
> laptop can perform both these tasks in a few minutes using a suitable 
> algorithm.

Do you have a cite for this? I'd heard of the first problem being 
broken, but not the second one.

> However, one might also conjecture that the probability of any two 
> arbitrary messages having the same [MD5] hash code would be 2^(-128).
> 
> Does the existence of a collision-generation algorithm for MD5 
> contradict this second conjecture?

Only if you do it on purpose. Obviously, to *arbitrary* messages will 
have low probablility of a hit.

> (In othe words, it is possible to *maliciously* alter a message without 
> affecting the MD5 hash, but what is the probability of an *accidental* 
> modification going undetected? Is it 2^(-128)? Or is it some higher 
> probability?)

It's still a low probability. You have to alter the message in a way 
that takes specifically into account how MD5 works. That's why the same 
technique doesn't work on other hashes.

> Actually, according to the Birthday Paradox, the probability in this case is much
lower.

Not with only two messages.

> But this depends on just how "random" MD5 really is. Apparently it's not something
anybody has looked at much. 

People looked for any collision for a decade or more before the folks 
who professionally analyze cryptographic systems figured out how to 
cause a collision. It's sufficiently unlikely a random change in a file 
will cause you any grief that you need to worry about it.

The reason people worry about it is they fear an active, intelligent 
adversary changing that which has been signed.

> In a sense, the important point about MD5 is that checking a CD this way ensures
that every single bit of data is physically read from the surface of the disk. If you
can do that, the data is probably correct anyway... 

Since the TOC is at the start of the CD, I find this much more likely to 
be broken. Checksum the TOC, the first file, and the last file, and 
*then* you don't have to read the whole CD. I'd say 1/3rd of my CDs that 
coaster have a good TOC and no actual file data on them.

> there are loads of places where you can say "hmm, if somehow the same numbers got
here, the results would be identical".

And that's exactly how they went about cracking it. :-)

-- 
   Darren New / San Diego, CA, USA (PST)
     Remember the good old days, when we
     used to complain about cryptography
     being export-restricted?

Post a reply to this message

From: Orchid XP v7
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 16:57:25
Message: <473cc0c5$1@news.povray.org>

Darren New wrote:
> Invisible wrote:
>> Quoting RFC #1321, Section 1:
>>
>> "It is conjectured that it is computationally infeasible to produce
>> two messages having the same message digest, or to produce any
>> message having a given prespecified target message digest."
>>
>> This conjecture has now been determined to be false. In fact, a single 
>> laptop can perform both these tasks in a few minutes using a suitable 
>> algorithm.
> 
> Do you have a cite for this? I'd heard of the first problem being 
> broken, but not the second one.

Wikipedia. (So... utterly trustworthy then. Mind you, I believe there's 
a reference at the bottom.)

You know how it is - somebody posts a near-break, somebody else extends 
it, somebody else makes it go faster, etc.

>> However, one might also conjecture that the probability of any two 
>> arbitrary messages having the same [MD5] hash code would be 2^(-128).
>>
>> Does the existence of a collision-generation algorithm for MD5 
>> contradict this second conjecture?
> 
> Only if you do it on purpose. Obviously, two *arbitrary* messages will 
> have low probablility of a hit.

Well... the fact that collisions can be constructed prove that they 
exist, which might suggest the function is "less random" than it should 
be...

>> (In othe words, it is possible to *maliciously* alter a message 
>> without affecting the MD5 hash, but what is the probability of an 
>> *accidental* modification going undetected? Is it 2^(-128)? Or is it 
>> some higher probability?)
> 
> It's still a low probability. You have to alter the message in a way 
> that takes specifically into account how MD5 works. That's why the same 
> technique doesn't work on other hashes.

True. I'd just like to put a number to it, that's all. ;-)

>> Actually, according to the Birthday Paradox, the probability in this 
>> case is much lower.
> 
> Not with only two messages.

Probability is defined as "if you did this X times, Y times the event 
would occur". So even with 2 messages, it matters.

>> But this depends on just how "random" MD5 really is. Apparently it's 
>> not something anybody has looked at much. 
> 
> People looked for any collision for a decade or more before the folks 
> who professionally analyze cryptographic systems figured out how to 
> cause a collision. It's sufficiently unlikely a random change in a file 
> will cause you any grief that you need to worry about it.

Well, if people searched for decades without finding any collisions, 
they must be reasonably rare then. (At least, for the kinds of messages 
people tried hashing.) That suggests that at least the function doesn't 
collide *massively* more often than it should...

> The reason people worry about it is they fear an active, intelligent 
> adversary changing that which has been signed.

Ah yeah, sure, MD5 isn't cryptographically secure. I get that. I'm just 
wondering how useful it is for detecting random alterations.

>> In a sense, the important point about MD5 is that checking a CD this 
>> way ensures that every single bit of data is physically read from the 
>> surface of the disk. If you can do that, the data is probably correct 
>> anyway... 
> 
> I'd say 1/3rd of my CDs that 
> coaster have a good TOC and no actual file data on them.

...which is why just checking that you can "see" the filenames is no 
good at all. Clearly the person who wrote the official procedure is a 
legal expert not a computer expert...

>> there are loads of places where you can say "hmm, if somehow the same 
>> numbers got here, the results would be identical".
> 
> And that's exactly how they went about cracking it. :-)

The worrying thing being, how likely is it that such a thing might just 
happen by accident? :-}

Post a reply to this message

From: Nicolas Alvarez
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 17:12:35
Message: <473cc453$1@news.povray.org>


> Well... the fact that collisions can be constructed prove that they 
> exist, which might suggest the function is "less random" than it should 
> be...

The fact that the digest is smaller than the input data proves that 
collisions exist. If the digest is 1 byte smaller than the input data, 
there are 256 collisions possible.

The question is how easily you can "construct" a collision (if the only 
  way is testing *all* combinations of bits, the hash function would be 
"ideal").

Post a reply to this message

From: Tim Attwood
Subject: Re: The nebulous question of probability
Date: 15 Nov 2007 20:22:07
Message: <473cf0bf$1@news.povray.org>

> By the way, do you know how MD5 works? It's kind of interesting. I'm 
> currently trying to implement it in Haskell - but I'm being thwarted 
> because it uses that brain-deadness known as "little-endian integers". 
> (These really shouldn't exist...)

Someone at Oxford has done it, but I think it uses the FFI.
http://web.comlab.ox.ac.uk/oucl/work/ian.lynagh/md5/haskell-md5-0.1.0/

Post a reply to this message

Goto Latest 10 Messages

Next 10 Messages >>>