POV-Ray: Newsgroups: povray.off-topic: The nebulous question of probability: Re: The nebulous question of probability

POV-Ray : Newsgroups : povray.off-topic : The nebulous question of probability : Re: The nebulous question of probability		Server Time 5 Jul 2025 22:07:36 EDT (-0400)
From: Invisible
Date: 15 Nov 2007 11:30:09
Message: <473c7411$1@news.povray.org>
Warp wrote:
> Invisible <voi### [at] devnull> wrote:
>> All I *really* want to know is the probability of a random alteration to 
>> a file going undetected by an MD5 hash comparison.
> 
>   I think it's quite unlikely.

I think it's astronomically unlikely. I'd just like to be able to put a 
hard number on that though. ;-)

>   I wonder how much more unlikely it would be if you calculated two MD5
> codes: The first one is calculated in the regular way, and the second one
> is calculated starting from the second byte in the file forwards, and taking
> the first byte as if it was the last (iow. effectively as if you had removed
> the first byte of the file and appended it to the end of the file).
> 
>   A random change which would keep one of the MD5 sums the same would
> quite unlikely keep the other one the same too. I have no idea about the
> probabilities, though.

Alternatively, take the MD5 *and* the SHA-1 hash as well. Or better yet, 
take the MD4, MD5, SHA-1, SHA-256, WIRLPOOL and RIPEMD hashes.

But let's face it, "md5sums" is a psuedo-standard file format that is 
quite widely recognised.

The real reason we're interested in this is for checking that CDs are 
readable. Currently, we "check" the CD by looking at the directory 
listing and confirming that every file is present and has the correct 
size in bytes.

In other words, we check that the Table Of Contents on the CD has been 
burned correctly. :-P

In a sense, the important point about MD5 is that checking a CD this way 
ensures that every single bit of data is physically read from the 
surface of the disk. If you can do that, the data is probably correct 
anyway...

(Let us not forget that the CD itself puts the data through two seperate 
layers of Reed-Solomon codes anyway, not to mention filesystem metadata 
probably has CRC32 checksums in there.)

If a CD is defective, you're far more likely to get read/write errors 
from the operating system than actual garbage data, so perhaps the 
actual hash value is irrelevant.




By the way, do you know how MD5 works? It's kind of interesting. I'm 
currently trying to implement it in Haskell - but I'm being thwarted 
because it uses that brain-deadness known as "little-endian integers". 
(These really shouldn't exist...)

Basically, you take your file and split it into 512-bit blocks, padding 
the final block to make it exactly 512 bits. (Actually it's slightly 
more complicated than that.) Then, starting from the hash value

   0123456789ABCDEFFEDCBA9876543210

you process each block using the MD5 "compression function", which 
updates the hash value. Add the new hash value to the previous hash 
value to get a final output, and then repeat for the next block. When 
all blocks have been processed, whatever the final hash value is, *that* 
is the MD5 hash value.

In particular, if you change one particular message block in such a way 
that the compression function maps it to the same value, then the MD5 
hash of the whole file will be unchanged. So you really want to 
scrutinise that compression function...

The compression function itself consists of a loop repeated 16 times, 
each time with different parameters. At each loop, you rotate the hash 
round, and mutate one of the 32-bit words within it. You do this by 
adding a mash value to it; this mash value is derived from

- A table of constants (derived from the sin(x) function)
- A bit-wise combination of the other 32-bit words in the hash. (The 
exact combination varies throughout the loop.)
- A 32-bit word from the input block. (The exact one is determined by 
the loop counter, but with a nontrivial relationship.)

The mash value is also bit-rotated by a loop-specific amount first.

And that, my friends, is basically the MD5 algorithm... It certainly 
*looks* like it should give your data a good mashing, but on the other 
hand, there are loads of places where you can say "hmm, if somehow the 
same numbers got here, the results would be identical".
Post a reply to this message