Please check this find/rm script I'm about to run as root

James Kanze · May 17, 2009

I created this C++ program to recursively search directories
for redundant files larger than 100MB and delete all but the
largest of them.

I've tested it myself as a normal user on a few dummy files
but am quite apprehensive to run it as root to clean up my
messy file systems.

Any comments on the program, advice on streamlining it or any
bugs spotted, etc.?

A wishlist for the program is to only delete the smaller files
if they are also older, i.e. the largest file that is
preserved must be newer than all other files. If the smaller
files are newer then the user is warned/prompted.

There's really nothing in standard C++ to support anything
involving directories, so you'll end up either reverting to a
third party library (Boost or other), or using system specific
requests. In either case, this isn't really the appropriate
forum (except maybe for Boost).

================================================
// find_duplicates.cpp

[...]

(I'm assuming Unix or a Unix-like system here, given the
commands used in the calls to system.)

Given all the calls to system in your code, my suggestion is, if
you want to take this approach, to write the program in shell;
it's much more natural for this sort of thing. See
comp.unix.shell.

Otherwise, you can use the Posix level interface: see
opendir/readdir/closedir, stat and unlink/rmdir. And if you
have questions, comp.unix.programming.

James Kanze · May 17, 2009

On May 16, 4:40 pm, Jules <[email protected]>
wrote:

I don't use shell scripts often enough to be able to justify
learning scripting in any detail, or retain what I would learn
for the purposes of this archival clean-up job.

Anytime you're working under Unix, at the command line
interface, you're using a shell "script". If you don't know the
shell, you might as well be using Windows. (In fact, if you
don't know how to use the shell effectively, Windows is a lot
more convenient.)

(Note that you've got 9/10ths of the script already written in
your calls to system.)

Jerry Coffin · May 22, 2009

I created this C++ program to recursively search directories for
redundant files larger than 100MB and delete all but the largest of
them.

I've tested it myself as a normal user on a few dummy files but am
quite apprehensive to run it as root to clean up my messy file
systems.

Any comments on the program, advice on streamlining it or any bugs
spotted, etc.?

A wishlist for the program is to only delete the smaller files if they
are also older, i.e. the largest file that is preserved must be newer
than all other files. If the smaller files are newer then the user is
warned/prompted.

Here's a general outline of how I'd do the job. I've included code to
interface to the file system in Win_find_files. As it stands, the
file_size_t and Win_find_files classes aren't portable (at all) but
_most_ of the rest should be (there's also the minor detail of the "c:
\\" in main, but that's mostly included for demonstration purposes
anyway.

#include <iostream>
#include <vector>
#include <algorithm>
#include <string>
#include <iterator>

#include <windows.h>

class file_time_t {
FILETIME ft;
public:
file_time_t(FILETIME const &t) : ft(t) {}

bool operator<(file_time_t const &other) const {
if (ft.dwHighDateTime < other.ft.dwHighDateTime)
return true;
if (other.ft.dwHighDateTime < ft.dwHighDateTime)
return false;
return ft.dwLowDateTime < other.ft.dwLowDateTime;
}
};

class file_size_t {
unsigned __int64 size_;
public:
file_size_t(unsigned high, unsigned low) {
size_ = high;
size_ <<= 32;
size_ |= low;
}

file_size_t(unsigned low) : size_(low) {}

operator unsigned __int64() const { return size_; }
};

std::string splice(std::string a, std::string const &b) {
if (a[a.size()-1] != '/')
a+= "/";
a+=b;
return a;
}

struct file {
std::string path_;
std::string name_;
file_time_t mod_date_;
file_size_t size_;
file(std::string path, std::string name, file_time_t mod_date,
file_size_t size)
: path_(path), name_(name), mod_date_(mod_date), size_(size)
{ }

bool operator<(file const &b) {
if (name_ < b.name_)
return true;
if (b.name_ < name_)
return false;

// the names are equal -- look at dates
if (mod_date_ < b.mod_date_)
return true;
if (b.mod_date_ < mod_date_)
return false;

// dates are equal -- look at sizes
return b.size_ < size_;
}
};

class Win_find_files {
file_size_t min_;
std::vector<file> &output_;

void enumerate(std::string const &dir) const {
WIN32_FIND_DATA data;
HANDLE finder;

std::string name =splice(dir,"*");

finder = FindFirstFile(name.c_str(), &data);
if (finder == INVALID_HANDLE_VALUE)
return;
do {
if (data.cFileName[0] == '.')
continue;
if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
enumerate(splice(dir, data.cFileName));
}
else {
file_size_t size(data.nFileSizeHigh, data.nFileSizeLow);
if (size > min_)
output_.push_back(file(dir, data.cFileName,
data.ftLastWriteTime, size));
}
} while (FindNextFile(finder, &data));
FindClose(finder);
}
public:
Win_find_files(file_size_t min_size, std::vector<file> &output)
: min_(min_size), output_(output)
{}

virtual void operator()(std::string const &start) const {
enumerate(start);
}
};

bool del_file(file const &a, file const &b) {
return !(a.size_ < b.size_) && b.mod_date_ < a.mod_date_;
}

bool warn(file const &a, file const &b) {
std::cout << "Possible duplicate:\n"
<< splice(b.path_, b.name_)
<< "\nmay be a duplicate of: \n"
<< splice(a.path_, a.name_);
std::cout << "\ndo you want to delete it?";
char ch;
std::cin >> ch;
return ch == 'y' || ch == 'Y';
}

int main() {
std::vector<file> files;

typedef std::vector<file> collection;

Win_find_files find(100*1024*1024, files);

find("c:/");

std::sort(files.begin(), files.end());

collection::iterator first = files.begin();
collection::iterator next = first+1;

while (next != files.end()) {
if (first->name_ != next->name_)
first = next;
else if (warn(*first, *next))
remove(splice(next->path_, next->name_).c_str());
++next;
}
}

It doesn't implement your specification precisely though -- instead of
generating a script to remove files, it always runs interactively and
removes files itself when/if you approve its doing so (but never removes
anything without asking).

despen · May 22, 2009

Jerry Coffin said:
....
Here's a general outline of how I'd do the job. I've included code to
interface to the file system in Win_find_files. As it stands, the
file_size_t and Win_find_files classes aren't portable (at all) but
_most_ of the rest should be (there's also the minor detail of the "c:
\\" in main, but that's mostly included for demonstration purposes
anyway. ....
#include <windows.h> ....
It doesn't implement your specification precisely though -- instead of
generating a script to remove files, it always runs interactively and
removes files itself when/if you approve its doing so (but never removes
anything without asking).

I think you missed another requirement.

The OP wanted to recursively search directories and
your implementation can only search folders.

Maybe you missed the fact that this was cross posted to comp.os.linux.misc?

Jerry Coffin · May 22, 2009

I think you missed another requirement.

I don't think so.

The OP wanted to recursively search directories and
your implementation can only search folders.

It does searches recursively. That's what this part does:

void enumerate(std::string const &dir) const {
[ ... ]
if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
enumerate(splice(dir, data.cFileName));

I.e. if what we've found is a directory, splice that name onto the
current directory name, and enumerate it by recursing.

Maybe you missed the fact that this was cross posted to comp.os.linux.misc?

Yes, I did miss that. I'm looking at it (and posting from)
comp.lang.c++. Fortunately, most of the changes are fairly minor name
changes (e.g. FindFirstFile, FindNextFile, and FindClose become opendir,
readdir and closedir respectively).

There are some other rather more substantial changes to make it work
really well though -- in particular, symbolic links can create cycles,
in which case the recursion could become infinite. Depending on the
situation, you can avoid that by ignoring symbolic links completely or
creating a set of the IDs of the directories you've already visited, and
return when attempting to enter a directory that's already in the set.

Fortunately, almost any decent book on POSIX (and I'd guess quite a few
web sites as well) feature canned code for traversing directories using
POSIX, and this has a simple enough interface that the OP should be able
to plug one in with only minor surgery.

despen · May 23, 2009

Jerry Coffin said:
I don't think so.

I still do.

It does searches recursively. That's what this part does:

No, you miss my point. (And I was trying for some humor.)
I can see it's recursive. The issue I was raising is that
it's searching FOLDERS (windows), not DIRECTORIES (linux).

void enumerate(std::string const &dir) const {
[ ... ]
if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
enumerate(splice(dir, data.cFileName));

I.e. if what we've found is a directory, splice that name onto the
current directory name, and enumerate it by recursing.

Maybe you missed the fact that this was cross posted to comp.os.linux.misc?

Click to expand...

Yes, I did miss that. I'm looking at it (and posting from)
comp.lang.c++. Fortunately, most of the changes are fairly minor name
changes (e.g. FindFirstFile, FindNextFile, and FindClose become opendir,
readdir and closedir respectively).

There are some other rather more substantial changes to make it work
really well though -- in particular, symbolic links can create cycles,
in which case the recursion could become infinite. Depending on the
situation, you can avoid that by ignoring symbolic links completely or
creating a set of the IDs of the directories you've already visited, and
return when attempting to enter a directory that's already in the set.

Fortunately, almost any decent book on POSIX (and I'd guess quite a few
web sites as well) feature canned code for traversing directories using
POSIX, and this has a simple enough interface that the OP should be able
to plug one in with only minor surgery.

As you said, the code you posted is Windows only and not POSIX.

The original post was quite some time ago and at that time I
advised the OP that Perl is the right language for this type of
thing.

Jerry Coffin · May 23, 2009

[ ... ]

The original post was quite some time ago and at that time I
advised the OP that Perl is the right language for this type of
thing.

Now why would you do a thing like that? His cross-post may have been a
bit odd, but it hardly merits that kind of punishment!

despen · May 23, 2009

Jerry Coffin said:
Now why would you do a thing like that? His cross-post may have been a
bit odd, but it hardly merits that kind of punishment!

Yes, I know, quite stylish to bad-talk perl.

If I hadn't just spent a few hours rewriting a thousand lines of
C in about 100 lines of perl I might have been tempted to not respond.

In this case Perl has a FIND package and can do all the rest of
the file management quite easily. It's pretty much ideal
for the given requirements.

I'm well aware of how easy it is to write unreadable code in
Perl, but that's easily avoidable. It's great stuff as far
as I'm concerned.

Anyway, language criticism should be a banned topic for articles
cross posted to a C++ group.

James Kanze · May 23, 2009

Fortunately, almost any decent book on POSIX (and I'd guess
quite a few web sites as well) feature canned code for
traversing directories using POSIX, and this has a simple
enough interface that the OP should be able to plug one in
with only minor surgery.

Click to expand...

If I had to do it quickly, under Posix, and portability wasn't a
concern, I'd just use ftw (traverse (walk) a file tree). The
interface is a bit hacky -- there's no void* for user data, for
example -- but for the problem of the original poster, it would
seem sufficient.

Nathan Keel · May 23, 2009

Yes, I know, quite stylish to bad-talk perl.

If I hadn't just spent a few hours rewriting a thousand lines of
C in about 100 lines of perl I might have been tempted to not respond.

In this case Perl has a FIND package and can do all the rest of
the file management quite easily. It's pretty much ideal
for the given requirements.

C++ is great, but what's wrong with Perl? Not OO friendly enough? It's
the perfect tool for a lot of things that are done on *nix command
line. Of course, use whatever you like best, be it Perl, C++, Ruby,
Python, PHP, they all can make life easier.

Help me out to correct logical error in this code	3	Jul 10, 2008
feedback on code design	23	May 30, 2012
MS S Compile problems MT and clr	1	May 9, 2006
compilation problem (port cpp from windows)	2	Nov 13, 2006
c++ conversion files	22	Dec 4, 2004
A better way to tail a file	1	Aug 9, 2003
Help with C++ program converting PBM (p4) to PPM (p5)	3	Sep 14, 2006
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005

Please check this find/rm script I'm about to run as root

James Kanze

James Kanze

Jerry Coffin

despen

Jerry Coffin

despen

Jerry Coffin

despen

James Kanze

Nathan Keel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads