[perl-python] a program to delete duplicate files

X

Xah Lee

here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.


--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 
T

Terry Hancock

here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don't
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the '--exec="ls %s"' option as mentioned in the help.

#!/usr/bin/env python
# (C) 2003 Anansi Spaceworks
#---------------------------------------------------------------------------
# find_duplicates
"""
Utility to find duplicate files in a directory tree by
comparing their checksums.
"""
#---------------------------------------------------------------------------
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#---------------------------------------------------------------------------

import os, sys, md5, getopt


def file_walker(tbl, srcpath, files):
"""
Visit a path and collect data (including checksum) for files in it.
"""
for file in files:
filepath = os.path.join(srcpath, file)
if os.path.isfile(filepath):
chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest()
if not tbl.has_key(chksum): tbl[chksum]=[]
tbl[chksum].append(filepath)

def find_duplicates(treeroot, tbl=None):
"""
Find duplicate files in directory.
"""
dup = {}
if tbl is None: tbl = {}
os.path.walk(treeroot, file_walker, tbl)
for k,v in tbl.items():
if len(v) > 1:
dup[k] = v
return dup

usage = """
USAGE: find_duplicates <options> [<path ...]

Find duplicate files (by matching md5 checksums) in a
collection of paths (defaults to the current directory).

Note that the order of the paths searched will be retained
in the resulting duplicate file lists. This can be used
with --exec and --index to automate handling.

Options:
-h, -H, --help
Print this help.

-q, --quiet
Don't print normal report.

-x, --exec=<command string>
Python-formatted command string to act on the indexed
duplicate in each duplicate group found. E.g. try
--exec="ls %s"

-n, --index=<index into duplicates>
Which in a series of duplicates to use. Begins with '1'.
Default is '1' (i.e. the first file listed).

Example:
You've copied many files from path ./A into path ./B. You want
to delete all the ones you've processed already, but not
delete anything else:

% find_duplicates -q --exec="rm %s" --index=1 ./A ./B
"""

def main():
action = None
quiet = 0
index = 1
dup = {}

opts, args = getopt.getopt(sys.argv[1:], 'qhHn:x:',
['quiet', 'help', 'exec=', 'index='])

for opt, val in opts:
if opt in ('-h', '-H', '--help'):
print usage
sys.exit()
elif opt in ('-x', '--exec'):
action = str(val)
elif opt in ('-n', '--index'):
index = int(val)
elif opt in ('-q', '--quiet'):
quiet = 1

if len(args)==0:
dup = find_duplicates('.')
else:
tbl = {}
for arg in args:
dup = find_duplicates(arg, tbl=tbl)

for k, v in dup.items():
if not quiet:
print "Duplicates:"
for f in v: print "\t%s" % f
if action:
os.system(action % v[index-1])

if __name__=='__main__':
main()
 
C

Christos TZOTZIOY Georgiou

For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don't
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the '--exec="ls %s"' option as mentioned in the help.

The drawback of this method is that you have to read everything. For example,
if you have ten files less than 100KiB each and one file more than 2 GiB in
size, there is no need to read the 2 GiB file, is there?

If it's a one-shot attempt, I guess it won't mind a lot.

On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
 
P

P

I've written a python GUI wrapper around some shell scripts:
http://www.pixelbeat.org/fslint/

the shell script logic is essentially:

exclude hard linked files
only include files where there are more than 1 with the same size
print files with matching md5sum

Pádraig.
 
P

Patrick Useldinger

Christos said:
On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.

I then have a bug here - I consider all files with the same inode equal,
but according to what you say I need to consider the tuple
(st_dev,ST_ium). I'll have to fix that for 0.13.

Thanks ;-)
-pu
 
P

Patrick Useldinger

Christos said:
That's fast and good.

Nice to hear.
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).

I'll look into that.
Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?

No. And even then, I wouldn't know how to find out.

-pu
 
D

David Eppstein

"Xah Lee said:
a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.
 
J

John Bokma

David said:
You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.

I did it as follows (some time ago):

is filesize in hash?

calculate md5 (and store), if equal then compare
files.

store info in hash.

In some cases if might be faster to drop the md5 (since it reads all data)
 
C

Christos TZOTZIOY Georgiou

No. And even then, I wouldn't know how to find out.

MSDN is our friend for Windows stuff.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createhardlink.asp

and then

http://msdn.microsoft.com/library/d...us/fileio/base/getfileinformationbyhandle.asp

http://msdn.microsoft.com/library/d...ileio/base/by_handle_file_information_str.asp

The relevant parts from this last page:

st_dev <-> dwVolumeSerialNumber

st_ino <-> (nFileIndexHigh, nFileIndexLow)
 
C

Christos TZOTZIOY Georgiou

I then have a bug here - I consider all files with the same inode equal,
but according to what you say I need to consider the tuple
(st_dev,ST_ium). I'll have to fix that for 0.13.

I have a bug here too-- I wrote st_inum meaning st_ino, but that would be quick
to find!
Thanks ;-)

You are very welcome.
 
P

Patrick Useldinger

Christos said:
The relevant parts from this last page:
st_dev <-> dwVolumeSerialNumber
st_ino <-> (nFileIndexHigh, nFileIndexLow)

I see. But if I am not mistaken, that would mean that I
(1) had to detect NTFS volumes
(2) use non-standard libraries to find these information (like the
Python Win extentions).

I am not seriously motivated to do so, but if somebody is interested to
help, I am open to it.

-pu
 
P

Patrick Useldinger

David said:
You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.

That's not very efficient. IMO, it only makes sense in network-based
operations such as rsync.

-pu
 
D

David Eppstein

You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.

That's not very efficient. IMO, it only makes sense in network-based
operations such as rsync.[/QUOTE]

Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.

More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).

I'm assuming of course that there are too many files and/or they're too
large just to keep them all in core.

Anyone have any data on whether reading files and SHA-256'ing them (or
whatever other cryptographic hash you think is strong enough) is
I/O-bound or CPU-bound? That is, is three reads with very little CPU
overhead really cheaper than one read with a strong hash?

I guess it also depends on the number of files you expect to have
duplicates of. If most of the files exist in only one copy, it's clear
that the cheap hash will find them more cheaply than the expensive hash.
In that case you could combine the (file size, cheap hash) filtering
with the expensive hash and get only two reads per copy rather than
three.
 
T

Terry Hancock

The drawback of this method is that you have to read everything. For example,
if you have ten files less than 100KiB each and one file more than 2 GiB in
size, there is no need to read the 2 GiB file, is there?

If it's a one-shot attempt, I guess it won't mind a lot.

On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.

Two good points, but hey, it ran fast enough for me. ;-)

Cheers,
Terry
 
P

Patrick Useldinger

David said:
Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.

That's exactly what my program does.
More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you

My point is : forget hashes. If you work with hashes, you do have to
read each file completely, cheap hash or not. My program normally reads
*at most* 100% of the files to analyse, but usually much less. Also, I
do plain comparisons which are much cheaper than hash calculations.
I'm assuming of course that there are too many files and/or they're too
large just to keep them all in core.

I assume that file handles are sufficient to keep one open per file of
the same size. This lead to trouble on Windows installations, but I
guess that's a parameter to change. On Linux, I never had the problem.

Regarding buffer size, I use a maxumim which is then split up between
all open files.
Anyone have any data on whether reading files and SHA-256'ing them (or
whatever other cryptographic hash you think is strong enough) is
I/O-bound or CPU-bound? That is, is three reads with very little CPU
overhead really cheaper than one read with a strong hash?

It also depends on the OS. I found that my program runs much slower on
Windows, probably due to the way Linux anticipates reads and tries to
reduce head movement.
I guess it also depends on the number of files you expect to have
duplicates of. If most of the files exist in only one copy, it's clear
that the cheap hash will find them more cheaply than the expensive hash.
In that case you could combine the (file size, cheap hash) filtering
with the expensive hash and get only two reads per copy rather than
three.

Sorry, but I can still not see a point tu use hashes. Maybe you'll have
a look at my program and tell me where a hash could be useful?

It's available at http://www.homepages.lu/pu/fdups.html.

Regards,
-pu
 
D

David Eppstein

Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.

That's exactly what my program does.[/QUOTE]

If you're doing any comparisons at all, you're not minimizing the number
of comparisons.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,692
Latest member
JenniferTi

Latest Threads

Top