waling a directory with very many files

T

tom

i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.
 
T

Tim Golden

tom said:
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

If you're on Windows, you can use the win32file.FindFilesIterator
function from the pywin32 package. (Which wraps the Win32 API
FindFirstFile / FindNextFile pattern).

TJG
 
T

tom

If you're on Windows, you can use the win32file.FindFilesIterator
function from the pywin32 package. (Which wraps the Win32 API
FindFirstFile / FindNextFile pattern).

thanks, tim.

however, i'm not using windows. freebsd and os x.
 
T

Tim Golden

tom said:
thanks, tim.

however, i'm not using windows. freebsd and os x.

Presumably, if Perl etc. can do it then it should be simple
enough to drop into ctypes and call the same library code, no?
(I'm not a BSD / OS X person, I'm afraid, so perhaps this isn't
so easy...)

TJG
 
A

Andre Engels

i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...
 
T

Terry Reedy

tom said:
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

You did not specify version. In Python3, os.walk has become a generater
function. So, to answer your question, use 3.1.

tjr
 
M

MRAB

Christian said:
Some time ago we had a discussion about turning os.listdir() into a
generator. No conclusion was agreed on. We also thought about exposing
the functions opendir(), readdir(), closedir() and friends but as far as
I know and as far as I've checked the C code in Modules/posixmodule.c
none of the functions as been added.
Perhaps if there's a generator it should be called iterdir(). Or would
it be unPythonic to have listdir() and iterdir()? Probably.
 
L

Lawrence D'Oliveiro

I suppose it depends how well-liked it is. Nerdy lists may work better, but
they tend not to be liked.
What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...

I worked on an application system which, at one point, routinely dealt with
directories containing hundreds of thousands of files. But even that kind of
directory contents only adds up to a few megabytes.
 
T

Tim Chase

i can traverse a directory using os.listdir() or os.walk(). but if a
You did not specify version. In Python3, os.walk has become a generater
function. So, to answer your question, use 3.1.

Since at least 2.4, os.walk has itself been a generator.
However, the contents of the directory (the 3rd element of the
yielded tuple) is a list produced by listdir() instead of a
generator. Unless listdir() has been changed to a generator
instead of a list (which other respondents seem to indicate has
not been implemented), this doesn't address the OP's issue of
"lots of files in a single directory".

-tkc
 
S

Steven D'Aprano

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...


You haven't looked very hard :)

$ pwd
/home/steve/.thumbnails/normal
$ ls | wc -l
33956

And I periodically delete thumbnails, to prevent the number of files
growing to hundreds of thousands.
 
H

Hrvoje Niksic

Terry Reedy said:
You did not specify version. In Python3, os.walk has become a
generater function. So, to answer your question, use 3.1.

os.walk has been a generator function all along, but that doesn't help
OP because it still uses os.listdir internally. This means that it
both creates huge lists for huge directories, and holds on to those
lists until the iteration over the directory (and all subdirectories)
is finished.

In fact, os.walk is not suited for this kind of memory optimization
because yielding a *list* of files (and a separate list of
subdirectories) is specified in its interface. This hasn't changed in
Python 3.1:

dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)

if topdown:
yield top, dirs, nondirs
 
H

Hrvoje Niksic

Nick Craig-Wood said:
Here is a ctypes generator listdir for unix-like OSes.

ctypes code scares me with its duplication of the contents of system
headers. I understand its use as a proof of concept, or for hacks one
needs right now, but can anyone seriously propose using this kind of
code in a Python program? For example, this seems much more
"Linux-only", or possibly even "32-bit-Linux-only", than "unix-like":
 
D

Diez B. Roggisch

tom said:
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

if we assume the number of files to be a million (which certainly qualifies
as one of the larger directory sizes one encounters...), and the average
filename length with 20, you'd end up with 20 megs of data.

Is that really a problem on nowadays several gigabyte machines? And we are
talking a rather freakish case here.

Diez
 
T

Terry Reedy

Christian said:
I'm sorry to inform you that Python 3.x still returns a list, not a
generator.
<class 'generator'>

However, it is a generator of directory tuples that include a filename
list produced by listdir, rather than a generator of filenames
themselves, as I was thinking. I wish listdir had been changed in 3.0
along with map, filter, and range, but I made no effort and hence cannot
complain.

tjr
 
M

Mike Kazantsev

<class 'generator'>

However, it is a generator of directory tuples that include a filename
list produced by listdir, rather than a generator of filenames
themselves, as I was thinking. I wish listdir had been changed in 3.0
along with map, filter, and range, but I made no effort and hence cannot
complain.

Why? We have itertools.imap, itertools.ifilter and xrange already.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko22e4ACgkQASbOZpzyXnE5vgCfSvvCbBrI8815JQlH1hAS3QmL
IIwAoO+PgEIuZpHJ3BzW994BWW6PMd2o
=Mfnq
-----END PGP SIGNATURE-----
 
H

Hrvoje Niksic

Nick Craig-Wood said:
It can be done properly with gccxml though which converts structures
into ctypes definitions.

That sounds interesting.
That said the dirent struct is specified by POSIX so if you get the
correct types for all the individual members then it should be
correct everywhere. Maybe ;-)

AFAIK POSIX specifies the names and types of the members, but not
their order in the structure, nor alignment.
 
T

thebjorn

You haven't looked very hard :)

$ pwd
/home/steve/.thumbnails/normal
$ ls | wc -l
33956

And I periodically delete thumbnails, to prevent the number of files
growing to hundreds of thousands.

Steven

Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a database
field... *sigh*

Oddly enough, I'm a relieved that others have had similar folder sizes
(I've been waiting for this burst to the top of my list for a while
now).

Bjorn
 
L

Lawrence D'Oliveiro

In message
thebjorn said:
Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a database
field... *sigh*

Why not put the images themselves into database fields?
Oddly enough, I'm a relieved that others have had similar folder sizes ...

One of my past projects had 400000-odd files in a single folder. They were
movie frames, to allow assembly of movie sequences on demand.
 
M

Mike Kazantsev

In message
Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a
database field... *sigh*

Why not put the images themselves into database fields?
Oddly enough, I'm a relieved that others have had similar folder
sizes ...

One of my past projects had 400000-odd files in a single folder. They
were movie frames, to allow assembly of movie sequences on demand.

For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?

That way, you won't have to deal with many-files-in-path problem, and,
since there's thousands of them anyway, name readability shouldn't
matter.

In fact, on modern filesystems it doesn't matter whether you accessing
/path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
with only hundred of them in each path. Former case (all-in-one-path)
would even outperform the latter with ext3 or reiserfs by a small
margin.
Sadly, that's not the case with filesystems like FreeBSD ufs2 (at least
in sixth branch), so it's better to play safe and create subdirs if the
app might be run on different machines than keeping everything in one
path.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko4YKYACgkQASbOZpzyXnGrzgCgqFcDRGNRsojqx8O6v9eq+oq6
N1UAnjUHdvQK6uQyo5Fs2fx39As9H+Ys
=UVXk
-----END PGP SIGNATURE-----
 
L

Lie Ryan

Mike said:
In message
Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a
database field... *sigh*
Why not put the images themselves into database fields?
Oddly enough, I'm a relieved that others have had similar folder
sizes ...
One of my past projects had 400000-odd files in a single folder. They
were movie frames, to allow assembly of movie sequences on demand.

For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?

That way, you won't have to deal with many-files-in-path problem, and,
since there's thousands of them anyway, name readability shouldn't
matter.

In fact, on modern filesystems it doesn't matter whether you accessing
/path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
with only hundred of them in each path. Former case (all-in-one-path)
would even outperform the latter with ext3 or reiserfs by a small
margin.
Sadly, that's not the case with filesystems like FreeBSD ufs2 (at least
in sixth branch), so it's better to play safe and create subdirs if the
app might be run on different machines than keeping everything in one
path.

It might not matter for the filesystem, but the file explorer (and ls)
would still suffer. Subfolder structure would be much better, and much
easier to navigate manually when you need to.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,744
Latest member
CortneyMcK

Latest Threads

Top