Program to retrieve all filenames in drive

J

jacob navia

Eric Sosman said:
What if the directories have multiple entries for the
same file, in the manner of those Windows versions that
support both "long" and "DOS" file names? If you return
all the entries you enumerate many files twice. If you
return only the "DOS" names you deprive the user of the
benefits of the "long" names. If you suppress the "DOS"
names of files that also have "long" names, you exasperate
the users who need to get at the "DOS" names. No one
strategy suits all scenarios, so suddenly you've got to
expand the API somehow.

I explained the solution in the tutorial article where I wrote
a function to scan files. (Section 1.33)

The solution is to use function pointers and let the user
figure out what he/she wants.

Easy isn't it?

No API. You just call a user defined function.

[snip]
The only language I've personally used that tried to
invent a directory abstraction wide enough to cover all these
cases (and more) was Common LISP -- and believe me, the
edifice was more impressive for its size and complexity than
for its utility. A lot of hard work by very smart people
produced something only a certified genius could use in any
but the simplest settings. Go look it up; it's eye-opening.

It is the wrong alley.

The solution is to use function pointers.

That is VERY EASY in C. It is one of the most powerful features
of C. The way you pass function tokens carrying an incredible
amount of context with just a few machine instructions.

There is NO BEST API, it is better that each usage involves
writing a small function that does exactly what you want.

www.cs.virginia.ed/~lcc-win32
 
A

Alan Balmer

Then you'd just return an empty list.
I can see that I should not have provided a starting point for the
thinking. I withdraw that, and change it to just

Think about it.
 
D

Dan Pop

In said:
No: the point is that the answer is *different* for Windows and
Linux. Moreover, neither the Windows solution nor the Linux solution
will work on VMS; and the VMS solution will not work on any of
those, nor on AmigaDOS; and so on.

Actually, a POSIX-based solution will work everywhere POSIX is implemented
and this includes Linux, Windows, VMS, MVS and many others.

It just happens that the right place for discussing a POSIX-based solution
is not c.l.c.

Dan
 
E

Eric Sosman

jacob said:
I explained the solution in the tutorial article where I wrote
a function to scan files. (Section 1.33)

The solution is to use function pointers and let the user
figure out what he/she wants.

Easy isn't it?

No API. You just call a user defined function.

The example in your tutorial (a recursive directory-tree
walker) simply passes the problem along to a user-supplied
callback function, still unsolved and without any information
to assist in its solution. The callback function (in your
design) receives nothing at all but the file name as taken
from the directory. The function receives no hint that the
name might be an alias for another name (and certainly no
indication of *which* other name).

You write, correctly, that "there are many options as to
what information should be provided to the user," -- so you
reduce the options by providing as little information as
you possibly can. You ask "Is he interested in the size of
the file" and you decline to provide the size. "Or in the
date," you ask, and omit the date. "Who knows?" Certainly
not the callback function, until and unless it decides to
engage in platform-specific non-portable shenanigans.

"The most flexible solution is the best," you write, but
this isn't flexibility: it's obliviousness. You haven't
solved the problems; you've just decided to ignore them.
 
J

jacob navia

"Eric Sosman" <[email protected]> a écrit dans le message de
jacob said:
"Eric Sosman" <[email protected]> a écrit dans le message de
The example in your tutorial (a recursive directory-tree
walker) simply passes the problem along to a user-supplied
callback function, still unsolved and without any information
to assist in its solution. The callback function (in your
design) receives nothing at all but the file name as taken
from the directory. The function receives no hint that the
name might be an alias for another name (and certainly no
indication of *which* other name).

You can get ALL kinds of information starting with the file name
including size, aliases, dates, etc etc etc.

Your function is suipposed to zero in the parts of this
wealth of info that you need.

The only thing the find files function should do is just that:

FIND THOSE FILES. PERIOD.

There is no other sensible way as you yourself remarked in
the message you posted in this forum.

The solution I propose gives the user complete flexibility as to which
information should be asked for.

jacob
 
M

Malcolm

Eric Sosman said:
What if the directories have multiple entries for the
same file, in the manner of those Windows versions that
support both "long" and "DOS" file names? If you return
all the entries you enumerate many files twice. If you
return only the "DOS" names you deprive the user of the
benefits of the "long" names. If you suppress the "DOS"
names of files that also have "long" names, you exasperate
the users who need to get at the "DOS" names. No one
strategy suits all scenarios, so suddenly you've got to
expand the API somehow.

Or suppose files have version numbers, as in OpenVMS?
VMS itself, of course, has ways to get at all versions of
a file, just the oldest, just the newest, and so on -- but
to get at those capabilities, you've got to expand the API
yet again.

Or suppose files appear several times in a directory
using something like Unix' hard links. To deal with this
intelligently, you again need to expand the API, at least
with some means of determining whether two directory entries
refer to the same inode.

The only language I've personally used that tried to
invent a directory abstraction wide enough to cover all these
cases (and more) was Common LISP -- and believe me, the
edifice was more impressive for its size and complexity than
for its utility. A lot of hard work by very smart people
produced something only a certified genius could use in any
but the simplest settings. Go look it up; it's eye-opening.
As far as I see it the concept is simple enough, you want to "list all
available files" that can be passed to fopen().
The problem is that on all but the smallest systems this produces a list
that is way too long, certainly for human usage and often for computer use
as well. So you need some system of reducing the number of files in scope,
which most OSes do by providing a tree-like hierarchy.

That leads you to problems such as trees containing links, the handling of
the directory itself (is it simply another file?), files with two names or
present in several versions, and probably more that you haven't enumerated
(zip files?).

I think what you would have to do is keep the simpicity of the interface

char **listfiles(int *N)

( list all files available for reading )

However you also need to provide a filter to cut the list down to size.

char **listfiles(struct filter *filt, int *N)

Passing NULL will literally list all available files (though you do need
some solution to the DOS two-names problem). Filling fields will enable you
to filter - and obvious one would be "list only current directory", "list
only latest versions", "don't list directories" could be other members.

Designing a really good filter wouldn't be easy and would require knowledge
of the many different systems out there, but I don't see producing something
usable as an insuperable problem.
 
K

Keith Thompson

jacob navia said:
"Eric Sosman" <[email protected]> a écrit dans le message de

The example in your tutorial (a recursive directory-tree
walker) simply passes the problem along to a user-supplied
callback function, still unsolved and without any information
to assist in its solution. The callback function (in your
design) receives nothing at all but the file name as taken
from the directory. The function receives no hint that the
name might be an alias for another name (and certainly no
indication of *which* other name).

You can get ALL kinds of information starting with the file name
including size, aliases, dates, etc etc etc.

But you provide no clue about *how* to get that information.

You provide (I presume; I haven't looked at it) a generic interface,
using user-supplied callback functions, that lets a program traverse
the names of all the files in a directory tree. There's nothing wrong
with that, it sounds like a useful thing. (An interface that
generates a sequence of names without using callback functions would
probably be equally useful, and might give the calling program better
control.) But you brought it up in the context of a question about
how to deal with multiple entries for the same file, DOS names
vs. long names, etc., implying that you provide a solution for that.

How do you deal with names that are aliases for other names? The
answer is simple, just leave it up to the user. Of course the answer
is simple; it just doesn't answer the question.

Unfortunately, there is no answer that is simple and actually
addresses the question.
 
K

Keith Thompson

Malcolm said:
I think what you would have to do is keep the simpicity of the interface

char **listfiles(int *N)

( list all files available for reading )

However you also need to provide a filter to cut the list down to size.

char **listfiles(struct filter *filt, int *N)

Passing NULL will literally list all available files (though you do need
some solution to the DOS two-names problem). Filling fields will enable you
to filter - and obvious one would be "list only current directory", "list
only latest versions", "don't list directories" could be other members.

Designing a really good filter wouldn't be easy and would require knowledge
of the many different systems out there, but I don't see producing something
usable as an insuperable problem.

This is similar to what the Unix "find" program does. If you want to
design such an interface, take a look at the command-line options to
"find"; the ones that control which files are selected should indicate
a subset of the information you'll need in "struct filter". (Of
course, "find" is Unix-specific. Much of the interface is probably
generic enough for similar systems like Windows and VMS, but it's
going to break down on more exotic systems.)
 
J

jacob navia

Keith Thompson said:
But you provide no clue about *how* to get that information.

You provide (I presume; I haven't looked at it) a generic interface,
using user-supplied callback functions, that lets a program traverse
the names of all the files in a directory tree. There's nothing wrong
with that, it sounds like a useful thing. (An interface that
generates a sequence of names without using callback functions would
probably be equally useful, and might give the calling program better
control.)

I thought about that but the generated list is several MB of storage
for small drives. For big drives with 40-50Gig in a partition
or even those 120GB partitions now possible that would scale
very badly.

I thought that a solution without that much intermediate storage
would be more efficient and I think it is...

I reflected about this problem really.
But you brought it up in the context of a question about
how to deal with multiple entries for the same file, DOS names
vs. long names, etc., implying that you provide a solution for that.

There is an API for that under Win32: GetShortName.
There are several other APIs for each thing you could ever
think about asking in this context. Finding information about
a file is a different (and simpler) problem than finding the files

Modularity implies keeping a routine centered about its main task
and avoid those "do it all" routines. Finding the aliases is just an API
away, finding the file size, etc the same.
How do you deal with names that are aliases for other names? The
answer is simple, just leave it up to the user. Of course the answer
is simple; it just doesn't answer the question.

A file can have a short and a long name (if you use the old and
obsolete FAT32 stuff). There is an API for that.
Unfortunately, there is no answer that is simple and actually
addresses the question.

My thesis is that "the question" can't be answered with a single
"do it all" solution. There are SO MANY possibilities that it is just
not doable, see the comparison with common lisp in another
message.

This is the solution for qsort too. There are too many ways of
answering the question "which comes first" for any data structure
whatsoever. A user defined function is the only way out.
 
K

Keith Thompson

jacob navia said:
"Keith Thompson" <[email protected]> a écrit dans le message de


I thought about that but the generated list is several MB of storage
for small drives. For big drives with 40-50Gig in a partition
or even those 120GB partitions now possible that would scale
very badly.

I thought that a solution without that much intermediate storage
would be more efficient and I think it is...

I reflected about this problem really.

There's no reason the interface has to provide the whole list at once,
any more than fread() has to provide the entire content of a file at
once. You could provide a function that initializes the query and
returns some kind of handle encoding its current state, and another
function that, given a handle, returns the next file name (or NULL if
there are no more). This makes it easier for the client to do things
like terminate the traversal early (though you can certainly design a
callback interface to handle that).

But regardless of the details of how you provide the sequence, there
are still a number of implementation-specific issues that have to be
addressed, even before you worry about how the client is going to
handle each name. In what order are the entries going to be returned,
and can the user control the order? If there are multiple entries for
a single physical file, does the sequence include all of them or just
one? Do you include directory names? What about other entities that
might exist in the filesystem's namespace (named pipes, devices,
etc. ad nauseam). What about "hidden" files, whatever that might mean
for a given system? What about multiple versions of the same file, as
in VMS? What if files are added or removed during the traversal?

These are all rhetorical questions; I ask them not because I'm looking
for answers, but to illustrate the complexity of the task.

Separating the task into two parts, getting a list of files and doing
whatever you want with each one, is a sensible approach, but each part
of the task is still extraordinarily complex if you're trying to do it
portably. A non-portable solution is likely to be much simpler but
inappropriate for this newsgroup.
 
K

kal

jacob navia said:
My thesis is that "the question" can't be answered with a single
"do it all" solution. There are SO MANY possibilities that it is just
not doable, see the comparison with common lisp in another
message.

The OP had taken the advice given here and posted the question
at a different forum where he has been provided with platform
specific suggesstions.

IMHO the suggestions are simple and sufficient.
 
G

Gordon Burditt

You provide (I presume; I haven't looked at it) a generic interface,
I thought about that but the generated list is several MB of storage
for small drives. For big drives with 40-50Gig in a partition
or even those 120GB partitions now possible that would scale
very badly.

It is possible, for extremely pathological circumstances, for the
list of file names on a drive to exceed the capacity of that drive.
(or all of its drives plus RAM plus CD-ROM drives plus ROM).

Consider a lot of small files very deep in the directory structure,
so that most of the file names begin with (this is really supposed
to be all on one line, but split up for posting):
/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
/ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
/ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
/fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
/ggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg
/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
/iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
/jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
/kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
/something

There are also some very *REAL* systems where getting a list of all
file names takes hours or even days (think about a large news server,
especially one where every post is kept in a separate file, with
terabytes of storage).

Gordon L. Burditt
 
C

Christian Bau

Alan Balmer said:
Think about it. What if there are no directories?

Or what if you have 200,000 files in 50,000 directories, and while your
code recursively searches through all directories, some other program
moves a few directories around?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,145
Messages
2,570,825
Members
47,371
Latest member
Brkaa

Latest Threads

Top