RW> MD5 (or any other hashing algorithm) is a lot more expensive
RW> than a comparison and especially so if MD5 needs to process 2G
RW> of data while the comparison would only need 8K.
You make several unfounded assumptions here. [...]
Two, that the number of comparisons is small. The more comparisons you
have, the more the advantage goes to the hashing algorithm. If you have
2 files, it is best to read the first 8K of each and compare them,
since, as you note, odds are that any differences will appear early on.
If you have 1000 files, reading the first 8K of each file for
comparison purposes means a great deal of seeking and reading;
It's about the same amount of seeking and a lot less reading than
computing a hash of each of the 1000 files. At least if the files are a
lot larger than 8k.
and then you either store the first 8K, leading to a large working set
(and the first time you swap, you've lost anything you won by avoiding
calculating hashes),
8k * 1000 is 8 MB. That's negligible. And you only have to store this if
there are actually 1000 files of the same size.
There is also a hybrid approach:
For each group of files of the same size, you could initially read only
the first 8k (or some other size large enough to find the first
difference with a high probability, but small enough to be dwarfed by
the overhead of open(2)), and if those are the identical, switch to
computing a hash (and as Ben said, you can use something like SHA512 -
where a collision is IMHO less likely than a false positive due to a
hardware or software error).
hp