[perl-python] a program to delete duplicate files

C

Christos TZOTZIOY Georgiou

More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).

The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum). However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish :)

It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (-:) other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...
 
B

Bengt Richter

That's exactly what my program does.

If you're doing any comparisons at all, you're not minimizing the number
of comparisons.
[/QUOTE]
ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori knowledge
of highly probable difference locations (e.g., internal date stamp at some offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets by some
simple heuristic probes.

If differences are likely to appear early in a sequential scan, perhaps developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to.

Regards,
Bengt Richter
 
J

John Machin

David said:
That's exactly what my program does.

If you're doing any comparisons at all, you're not minimizing the number
of comparisons.[/QUOTE]

Gah. You two should be lawyers. I don't know why you're arguing about
whatever you are arguing about.

Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

1. d <= S
2. Anyone have a usable hash function that's faster than string
comparison?
3. Anyone have an answer to PU's request for advice where his algorithm
could be improved by using a hash method?

Temporary injunction in favour of PU.
 
P

Patrick Useldinger

John said:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time


Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's a
large number of identical hashes, you'd still need to read all of these
files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.

-pu
 
S

Scott David Daniels

Patrick said:
Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.
3) Probability is on your side. If two files match in length, but differ
in contents, a good hash will have probability (1 / max_hash) of
having a matching hash. For exactly two files of the same length,
calculating the hash is clearly a waste. For even three files of
the same length, you are most likely to find them distinct in three
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.
Naturally, as it grows beyond three, the deference grows drastically.

-Scott David Daniels
(e-mail address removed)
 
P

Patrick Useldinger

Scott said:
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.

That's provided you compare always 2 files at a time. I compare n files
at a time, n being the number of files of the same size. That's quicker
than hashes because I have a fair chance of finding a difference before
the end of files. Otherwise, it's like hashes without computation and
without having to have a second go to *really* compare them.

-pu
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[Patrick Useldinger]
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's
a large number of identical hashes, you'd still need to read all of
these files to make sure.

Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.

I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.

Similar arguments apply, say, for the confidence we may have in
probabilistic algorithms for the determination of number primality.
 
P

Patrick Useldinger

François Pinard said:
Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.

We're talking about md5, sha1 or similar. They are all known not to be
100% perfect. I agree it's a rare case, but still, why settle on
something "about right" when you can have "right"?
I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.

It was not my only argument for not using hashed. My algorithm also does
less reads, for example.

-pu
 
J

John Machin

Patrick said:
John said:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time


Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's a
large number of identical hashes, you'd still need to read all of these
files to make sure.

Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".

1. You are ahead already even before adding any extra time for
comparison on files with equal hashes.

2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?


[1] Yes, I am aware of the subject of this thread.

Regards,

John
 
P

Patrick Useldinger

John said:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".

Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Still, if you can get it 100% right automatically, why would you bother
checking manually? Why get back to argments like "impossible",
"implausible", "can't be" if you can have a simple and correct answer -
yes or no?

Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on
the context in which you use fdups, but then we'd have to discuss the
context. I never assumed any context, in order to keep it as universal
as possible.
Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

For the time being, the additional files will be ignored, and a warning
is issued. fdups does not quit, why are you saying this?

A fallback solution would be to open the file before every _block_ read,
and close it afterwards. In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles
in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so
that it's a proper class that can be used in other Python programs, as
you also asked. That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by
(my own) priority.
You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.
Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?

Sorry, I do not understand what you mean by this.

-pu
 
J

John Machin

Patrick said:
John said:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare*
....".

Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Still, if you can get it 100% right automatically, why would you bother
checking manually?

A human in their right mind is required to decide what to do with the
duplicates. The proponents of hashing -- of which I'm not one -- would
point out that any false-positives would be picked up as part of the
human scrutiny.
Why get back to argments like "impossible",
"implausible", "can't be" if you can have a simple and correct answer -
yes or no?

Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.
Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on
the context in which you use fdups, but then we'd have to discuss the
context. I never assumed any context, in order to keep it as universal
as possible.

That's very good, but it wasn't under contention.
Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

For the time being, the additional files will be ignored, and a warning
is issued. fdups does not quit, why are you saying this?

I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.
A fallback solution would be to open the file before every _block_ read,
and close it afterwards.

Ugh. Better use more memory, so less blocks!!
In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles
in a multitasking environment.

The pythonic way is to press ahead optimistically and recover if you
get bad news.
Not difficult to implement, but I first wanted to refactor the code so
that it's a proper class that can be used in other Python programs, as
you also asked.

I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].
That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by
(my own) priority.


I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.

Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Sorry, I do not understand what you mean by this.

Test:
!for k in range(1000):
! open('foo' + str(k), 'w')
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."

Cheers,
John
 
P

Patrick Useldinger

John said:
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.

Of course, but it's good to know that the computer is right, isn't it?
That leaves the human to take decisions instead of double-checking.
I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.

Bufferpool is a parameter, and the default values allow for 4096 files
of the same size. It's more likely to run out of file handles than out
of bufferspace, don't you think?
The pythonic way is to press ahead optimistically and recover if you
get bad news.

You're right, that's what I thought about afterwards. Current idea is to
design a second class that opens/closes/reads the files and handles the
situation independantly of the main class.
I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].

That is what I have submitted to you. Are you sure that *I* am the
lawyer here?

See ;-)
Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?

Nothing I started manually, but the usual bunch of local firewall, virus
scanner (not doing a complete machine check at that time).
Test:
!for k in range(1000):
! open('foo' + str(k), 'w')

I'll try that.
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."

I wouldn't count on that on a multi-tasking environment, as I said. The
class I described earlier seems a cleaner approach.

Regards,
-pu
 
P

Patrick Useldinger

John said:
Test:
!for k in range(1000):
! open('foo' + str(k), 'w')

I ran that and watched it open 2 million files and going strong ...
until I figured that files are closed by Python immediately because
there's no reference to them ;-)

Here's my code:

#!/usr/bin/env python
import os
print 'max number of file handles today is',
n = 0
h = []
try:
while True:
filename = 'mfh' + str(n)
h.append((file(filename,'w'),filename))
n = n + 1
except:
print n
for handle, filename in h:
handle.close()
os.remove(filename)

On Slackware 10.1, this yields 1021.
On WinXPSP2, this yields 509.

-pu
 
J

John J. Lee

Patrick Useldinger said:
John Machin wrote: [...]
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
[...]
Still, if you can get it 100% right automatically, why would you
bother checking manually? Why get back to argments like "impossible",
"implausible", "can't be" if you can have a simple and correct answer
-
yes or no?
[...]

Well, as Francois pointed out, it is strictly not physically possible
to obtain a perfectly reliable answer, even if you *do* do the
comparison.

Even so, you're right on this point (though IIUC it's not practically
important ATM): regardless of wild flukes, people can deliberately
wangle files to get a hash collision. so comparison is better than
hashing from this PoV.


John
 
J

John J. Lee

Patrick Useldinger said:
If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than
m.

Hmm, Patrick's right, David, isn't he?

Except when m gets really BIG (say, M), in which case I suppose m is
no longer the worst case number of whole-file reads.

And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).


John
 
D

David Eppstein

"John Machin said:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.

The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.

The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.
 
D

David Eppstein

Patrick Useldinger said:
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee.

When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.
 
P

Patrick Useldinger

David said:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.

I am not an expert in this field. All I know is that MD5 and SHA1 can
create collisions. Are there stronger algorithms that do not? And, more
importantly, has it been *proved* that they do not?

-pu
 
P

Patrick Useldinger

David said:
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.

If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than m.

-pu
 
B

Bengt Richter

John Machin said:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.

The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
What do you mean by "a comparison based method" ? Are you excluding the
possibility of parallel reading and comparing? The problem then becomes
verifying that m buffers containing corresponding segments of the files
are equal. You could do that by comparing hashes of the buffers (or updated
running hashes for the whole files so far read), or you could compare the
buffers in parallel byte by byte if that turns out efficient to implement
in the given language and os environment. Small buffers would probably
not be efficient for disk access, but too large might not be good either.
Maybe some automatic tuning algorithm could be developed to optimize
comparison operations in the shadow of i/o. But does the OS accept hints
about read-ahead buffering? Are the disks SCSI raid or USB external or
networked? Can automatic tuning achieve an optimum disregarding all that?

What is it that we're trying to optimize? Time to classify the
whole file set into groups of equivalent files? (note that there can be
multiple groups that are internally equal but uneqal to members of other groups).

The problem of limitation on the number of simultaneously open files could
be virtualized away, and only have impact when the limit is actually encountered.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.
So the worst case will be reading all through to the end once in parallel.
But pre-hashing guarantees the worst case for all -- unless disk access patterns
happen to make that the most efficient way to get everything read and compared
when most are equal.
The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.
ISTM 2(m-1) reads is not necessary, so "the question" to me doesn't involve that ;-)

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top