|
|
Warp wrote:
> Invisible <voi### [at] devnull> wrote:
>> All I *really* want to know is the probability of a random alteration to
>> a file going undetected by an MD5 hash comparison.
>
> I think it's quite unlikely.
I think it's astronomically unlikely. I'd just like to be able to put a
hard number on that though. ;-)
> I wonder how much more unlikely it would be if you calculated two MD5
> codes: The first one is calculated in the regular way, and the second one
> is calculated starting from the second byte in the file forwards, and taking
> the first byte as if it was the last (iow. effectively as if you had removed
> the first byte of the file and appended it to the end of the file).
>
> A random change which would keep one of the MD5 sums the same would
> quite unlikely keep the other one the same too. I have no idea about the
> probabilities, though.
Alternatively, take the MD5 *and* the SHA-1 hash as well. Or better yet,
take the MD4, MD5, SHA-1, SHA-256, WIRLPOOL and RIPEMD hashes.
But let's face it, "md5sums" is a psuedo-standard file format that is
quite widely recognised.
The real reason we're interested in this is for checking that CDs are
readable. Currently, we "check" the CD by looking at the directory
listing and confirming that every file is present and has the correct
size in bytes.
In other words, we check that the Table Of Contents on the CD has been
burned correctly. :-P
In a sense, the important point about MD5 is that checking a CD this way
ensures that every single bit of data is physically read from the
surface of the disk. If you can do that, the data is probably correct
anyway...
(Let us not forget that the CD itself puts the data through two seperate
layers of Reed-Solomon codes anyway, not to mention filesystem metadata
probably has CRC32 checksums in there.)
If a CD is defective, you're far more likely to get read/write errors
from the operating system than actual garbage data, so perhaps the
actual hash value is irrelevant.
By the way, do you know how MD5 works? It's kind of interesting. I'm
currently trying to implement it in Haskell - but I'm being thwarted
because it uses that brain-deadness known as "little-endian integers".
(These really shouldn't exist...)
Basically, you take your file and split it into 512-bit blocks, padding
the final block to make it exactly 512 bits. (Actually it's slightly
more complicated than that.) Then, starting from the hash value
0123456789ABCDEFFEDCBA9876543210
you process each block using the MD5 "compression function", which
updates the hash value. Add the new hash value to the previous hash
value to get a final output, and then repeat for the next block. When
all blocks have been processed, whatever the final hash value is, *that*
is the MD5 hash value.
In particular, if you change one particular message block in such a way
that the compression function maps it to the same value, then the MD5
hash of the whole file will be unchanged. So you really want to
scrutinise that compression function...
The compression function itself consists of a loop repeated 16 times,
each time with different parameters. At each loop, you rotate the hash
round, and mutate one of the 32-bit words within it. You do this by
adding a mash value to it; this mash value is derived from
- A table of constants (derived from the sin(x) function)
- A bit-wise combination of the other 32-bit words in the hash. (The
exact combination varies throughout the loop.)
- A 32-bit word from the input block. (The exact one is determined by
the loop counter, but with a nontrivial relationship.)
The mash value is also bit-rotated by a loop-specific amount first.
And that, my friends, is basically the MD5 algorithm... It certainly
*looks* like it should give your data a good mashing, but on the other
hand, there are loads of places where you can say "hmm, if somehow the
same numbers got here, the results would be identical".
Post a reply to this message
|
|