C
Christos TZOTZIOY Georgiou
More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).
The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum). However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish
It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (- other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...