|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Quoting RFC #1321, Section 1:
"It is conjectured that it is computationally infeasible to produce
two messages having the same message digest, or to produce any
message having a given prespecified target message digest."
This conjecture has now been determined to be false. In fact, a single
laptop can perform both these tasks in a few minutes using a suitable
algorithm.
However, one might also conjecture that the probability of any two
arbitrary messages having the same [MD5] hash code would be 2^(-128).
Does the existence of a collision-generation algorithm for MD5
contradict this second conjecture?
(In othe words, it is possible to *maliciously* alter a message without
affecting the MD5 hash, but what is the probability of an *accidental*
modification going undetected? Is it 2^(-128)? Or is it some higher
probability?)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible wrote:
> However, one might also conjecture that the probability of any two
> arbitrary messages having the same [MD5] hash code would be 2^(-128).
Actually, according to the Birthday Paradox, the probability in this
case is much lower.
All I *really* want to know is the probability of a random alteration to
a file going undetected by an MD5 hash comparison. But this depends on
just how "random" MD5 really is. Apparently it's not something anybody
has looked at much. The existence of a collision algorithm seems to
suggest that it's not all that random, but I don't know...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible <voi### [at] devnull> wrote:
> All I *really* want to know is the probability of a random alteration to
> a file going undetected by an MD5 hash comparison.
I think it's quite unlikely.
I wonder how much more unlikely it would be if you calculated two MD5
codes: The first one is calculated in the regular way, and the second one
is calculated starting from the second byte in the file forwards, and taking
the first byte as if it was the last (iow. effectively as if you had removed
the first byte of the file and appended it to the end of the file).
A random change which would keep one of the MD5 sums the same would
quite unlikely keep the other one the same too. I have no idea about the
probabilities, though.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Invisible <voi### [at] devnull> wrote:
>> All I *really* want to know is the probability of a random alteration to
>> a file going undetected by an MD5 hash comparison.
>
> I think it's quite unlikely.
I think it's astronomically unlikely. I'd just like to be able to put a
hard number on that though. ;-)
> I wonder how much more unlikely it would be if you calculated two MD5
> codes: The first one is calculated in the regular way, and the second one
> is calculated starting from the second byte in the file forwards, and taking
> the first byte as if it was the last (iow. effectively as if you had removed
> the first byte of the file and appended it to the end of the file).
>
> A random change which would keep one of the MD5 sums the same would
> quite unlikely keep the other one the same too. I have no idea about the
> probabilities, though.
Alternatively, take the MD5 *and* the SHA-1 hash as well. Or better yet,
take the MD4, MD5, SHA-1, SHA-256, WIRLPOOL and RIPEMD hashes.
But let's face it, "md5sums" is a psuedo-standard file format that is
quite widely recognised.
The real reason we're interested in this is for checking that CDs are
readable. Currently, we "check" the CD by looking at the directory
listing and confirming that every file is present and has the correct
size in bytes.
In other words, we check that the Table Of Contents on the CD has been
burned correctly. :-P
In a sense, the important point about MD5 is that checking a CD this way
ensures that every single bit of data is physically read from the
surface of the disk. If you can do that, the data is probably correct
anyway...
(Let us not forget that the CD itself puts the data through two seperate
layers of Reed-Solomon codes anyway, not to mention filesystem metadata
probably has CRC32 checksums in there.)
If a CD is defective, you're far more likely to get read/write errors
from the operating system than actual garbage data, so perhaps the
actual hash value is irrelevant.
By the way, do you know how MD5 works? It's kind of interesting. I'm
currently trying to implement it in Haskell - but I'm being thwarted
because it uses that brain-deadness known as "little-endian integers".
(These really shouldn't exist...)
Basically, you take your file and split it into 512-bit blocks, padding
the final block to make it exactly 512 bits. (Actually it's slightly
more complicated than that.) Then, starting from the hash value
0123456789ABCDEFFEDCBA9876543210
you process each block using the MD5 "compression function", which
updates the hash value. Add the new hash value to the previous hash
value to get a final output, and then repeat for the next block. When
all blocks have been processed, whatever the final hash value is, *that*
is the MD5 hash value.
In particular, if you change one particular message block in such a way
that the compression function maps it to the same value, then the MD5
hash of the whole file will be unchanged. So you really want to
scrutinise that compression function...
The compression function itself consists of a loop repeated 16 times,
each time with different parameters. At each loop, you rotate the hash
round, and mutate one of the 32-bit words within it. You do this by
adding a mash value to it; this mash value is derived from
- A table of constants (derived from the sin(x) function)
- A bit-wise combination of the other 32-bit words in the hash. (The
exact combination varies throughout the loop.)
- A 32-bit word from the input block. (The exact one is determined by
the loop counter, but with a nontrivial relationship.)
The mash value is also bit-rotated by a loop-specific amount first.
And that, my friends, is basically the MD5 algorithm... It certainly
*looks* like it should give your data a good mashing, but on the other
hand, there are loads of places where you can say "hmm, if somehow the
same numbers got here, the results would be identical".
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible wrote:
> The compression function itself consists of a loop repeated 16 times,
> each time with different parameters.
Gah! 64 times... >_<
Maybe I should just stop typing now? :-S
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> By the way, do you know how MD5 works?
I would assume Warp knows... :)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible wrote:
> Quoting RFC #1321, Section 1:
>
> "It is conjectured that it is computationally infeasible to produce
> two messages having the same message digest, or to produce any
> message having a given prespecified target message digest."
>
> This conjecture has now been determined to be false. In fact, a single
> laptop can perform both these tasks in a few minutes using a suitable
> algorithm.
Do you have a cite for this? I'd heard of the first problem being
broken, but not the second one.
> However, one might also conjecture that the probability of any two
> arbitrary messages having the same [MD5] hash code would be 2^(-128).
>
> Does the existence of a collision-generation algorithm for MD5
> contradict this second conjecture?
Only if you do it on purpose. Obviously, to *arbitrary* messages will
have low probablility of a hit.
> (In othe words, it is possible to *maliciously* alter a message without
> affecting the MD5 hash, but what is the probability of an *accidental*
> modification going undetected? Is it 2^(-128)? Or is it some higher
> probability?)
It's still a low probability. You have to alter the message in a way
that takes specifically into account how MD5 works. That's why the same
technique doesn't work on other hashes.
> Actually, according to the Birthday Paradox, the probability in this case is much
lower.
Not with only two messages.
> But this depends on just how "random" MD5 really is. Apparently it's not something
anybody has looked at much.
People looked for any collision for a decade or more before the folks
who professionally analyze cryptographic systems figured out how to
cause a collision. It's sufficiently unlikely a random change in a file
will cause you any grief that you need to worry about it.
The reason people worry about it is they fear an active, intelligent
adversary changing that which has been signed.
> In a sense, the important point about MD5 is that checking a CD this way ensures
that every single bit of data is physically read from the surface of the disk. If you
can do that, the data is probably correct anyway...
Since the TOC is at the start of the CD, I find this much more likely to
be broken. Checksum the TOC, the first file, and the last file, and
*then* you don't have to read the whole CD. I'd say 1/3rd of my CDs that
coaster have a good TOC and no actual file data on them.
> there are loads of places where you can say "hmm, if somehow the same numbers got
here, the results would be identical".
And that's exactly how they went about cracking it. :-)
--
Darren New / San Diego, CA, USA (PST)
Remember the good old days, when we
used to complain about cryptography
being export-restricted?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New wrote:
> Invisible wrote:
>> Quoting RFC #1321, Section 1:
>>
>> "It is conjectured that it is computationally infeasible to produce
>> two messages having the same message digest, or to produce any
>> message having a given prespecified target message digest."
>>
>> This conjecture has now been determined to be false. In fact, a single
>> laptop can perform both these tasks in a few minutes using a suitable
>> algorithm.
>
> Do you have a cite for this? I'd heard of the first problem being
> broken, but not the second one.
Wikipedia. (So... utterly trustworthy then. Mind you, I believe there's
a reference at the bottom.)
You know how it is - somebody posts a near-break, somebody else extends
it, somebody else makes it go faster, etc.
>> However, one might also conjecture that the probability of any two
>> arbitrary messages having the same [MD5] hash code would be 2^(-128).
>>
>> Does the existence of a collision-generation algorithm for MD5
>> contradict this second conjecture?
>
> Only if you do it on purpose. Obviously, two *arbitrary* messages will
> have low probablility of a hit.
Well... the fact that collisions can be constructed prove that they
exist, which might suggest the function is "less random" than it should
be...
>> (In othe words, it is possible to *maliciously* alter a message
>> without affecting the MD5 hash, but what is the probability of an
>> *accidental* modification going undetected? Is it 2^(-128)? Or is it
>> some higher probability?)
>
> It's still a low probability. You have to alter the message in a way
> that takes specifically into account how MD5 works. That's why the same
> technique doesn't work on other hashes.
True. I'd just like to put a number to it, that's all. ;-)
>> Actually, according to the Birthday Paradox, the probability in this
>> case is much lower.
>
> Not with only two messages.
Probability is defined as "if you did this X times, Y times the event
would occur". So even with 2 messages, it matters.
>> But this depends on just how "random" MD5 really is. Apparently it's
>> not something anybody has looked at much.
>
> People looked for any collision for a decade or more before the folks
> who professionally analyze cryptographic systems figured out how to
> cause a collision. It's sufficiently unlikely a random change in a file
> will cause you any grief that you need to worry about it.
Well, if people searched for decades without finding any collisions,
they must be reasonably rare then. (At least, for the kinds of messages
people tried hashing.) That suggests that at least the function doesn't
collide *massively* more often than it should...
> The reason people worry about it is they fear an active, intelligent
> adversary changing that which has been signed.
Ah yeah, sure, MD5 isn't cryptographically secure. I get that. I'm just
wondering how useful it is for detecting random alterations.
>> In a sense, the important point about MD5 is that checking a CD this
>> way ensures that every single bit of data is physically read from the
>> surface of the disk. If you can do that, the data is probably correct
>> anyway...
>
> I'd say 1/3rd of my CDs that
> coaster have a good TOC and no actual file data on them.
...which is why just checking that you can "see" the filenames is no
good at all. Clearly the person who wrote the official procedure is a
legal expert not a computer expert...
>> there are loads of places where you can say "hmm, if somehow the same
>> numbers got here, the results would be identical".
>
> And that's exactly how they went about cracking it. :-)
The worrying thing being, how likely is it that such a thing might just
happen by accident? :-}
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> Well... the fact that collisions can be constructed prove that they
> exist, which might suggest the function is "less random" than it should
> be...
The fact that the digest is smaller than the input data proves that
collisions exist. If the digest is 1 byte smaller than the input data,
there are 256 collisions possible.
The question is how easily you can "construct" a collision (if the only
way is testing *all* combinations of bits, the hash function would be
"ideal").
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> By the way, do you know how MD5 works? It's kind of interesting. I'm
> currently trying to implement it in Haskell - but I'm being thwarted
> because it uses that brain-deadness known as "little-endian integers".
> (These really shouldn't exist...)
Someone at Oxford has done it, but I think it uses the FFI.
http://web.comlab.ox.ac.uk/oucl/work/ian.lynagh/md5/haskell-md5-0.1.0/
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|