Processing huge datasets

Anders =?UTF-8?B?U8O4bmRlcmdhYXJk?= · May 10, 2004

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

I'm looking at rewriting the code to operate on parts of the hierarchy at a
time and store the processed data structure in another Berkeley DB so I can
query that afterwards. But I'd really prefer keeping all things in memory
due to the huge performance gain.

Any pointers?

Cheers, Anders

Aahz · May 10, 2004

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.

Well, Don't Do That. ;-)

I don't understand the data structure you're describing; you'll either
need to choose something more appropriate for recursive processing or
switch to iterative processing (probably using generators).

Thomas Guettler · May 10, 2004

Am Mon, 10 May 2004 12:00:03 +0000 schrieb Anders Søndergaard:

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

Hi Anders,

I use ZODB.
http://zope.org/Wikis/ZODB/FrontPage/guide/index.html

HTH,
Thomas

Dennis Lee Bieber · May 10, 2004

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

Assuming the modification time is a 32-bit integer, and sizes
are a 32-bit integer, you've got 8-bytes per file right there... or
160MB just for the timestamp/size for your 20million files... Add in
overhead for the data structures themselves (are you also keeping file
names? Even a fixed 8.3 format -- no count/terminator/"." -- will add
another 220MB, giving 380MB without overhead).

--

Terry Reedy · May 10, 2004

Anders SÃ¸ndergaard said:
I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

I would start with 2 gigs of RAM, which would allow about 90 bytes per
entry after allowing 200 megs for os and interpreter. Even that might not
be enough.

tjr

Anders S. Jensen · May 11, 2004

Aahz said:
Well, Don't Do That. ;-)

I don't understand the data structure you're describing; you'll either
need to choose something more appropriate for recursive processing or
switch to iterative processing (probably using generators).

Let me explain the problem a bit more in detail.
(Sorry for confusing things by using my private email address)

I'm trying to produce a system that will analyze a filesystem
for 'dead leafs', large data consumption, pr. UID volume consumption,
trend analysis, etc.

To do that I first save each directory along with sum of C, M and A
timestamps for the files in the directory (all in epoch secs.), number
of files, volume of files, list of UID's, list of GID's, dict of volume
pr. UID, dict of volume pr. GID and newest file.

Then I build a hierarchy of instances that knows it's parents, children
and siblings.
Each object is populated with the summarized file information.
When that is done, I traverse the hierarchy from the bottom up,
accumulating average C, M and A times, Volumes and number of files.

This hierarchy allows me to instantly query, say, the average
modification time for any given point in the directory structure and
below. That'll show where files that havent been modified in a long time
hides, and how much space they take amongst other things.

The LCRS tree lends itself very well for recursion in terms of beauty
and elegance of the code. However keeping that amount of data in memory
obviously just doesn't fly.

I'm at a point where the system actually just *barely* might work. I'll
know tomorrow when it's done. But the system might be used to work on
much larger filesystems, and then the party is over.

I'm looking for the best suited way of attacking the problem. Is it
a memory map file? A Berkeley DB? A specially crafted metadata filesystem?
(that would be fun but probably overkill..

or something completely
different.

The processing might be switched to iterative, but that's a 'minor' concern.
The main problem is how to handle the data in the fastest possible way.

Thanks for your pointers!

Cheers,
Anders

William Park · May 12, 2004

Anders S. Jensen said:
I'm trying to produce a system that will analyze a filesystem for
'dead leafs', large data consumption, pr. UID volume consumption,
trend analysis, etc.

To do that I first save each directory along with sum of C, M and A
timestamps for the files in the directory (all in epoch secs.), number
of files, volume of files, list of UID's, list of GID's, dict of
volume pr. UID, dict of volume pr. GID and newest file.

Then I build a hierarchy of instances that knows it's parents,
children and siblings. Each object is populated with the summarized
file information. When that is done, I traverse the hierarchy from
the bottom up, accumulating average C, M and A times, Volumes and
number of files.

This hierarchy allows me to instantly query, say, the average
modification time for any given point in the directory structure and
below. That'll show where files that havent been modified in a long
time hides, and how much space they take amongst other things.

The LCRS tree lends itself very well for recursion in terms of beauty
and elegance of the code. However keeping that amount of data in
memory obviously just doesn't fly.

I'm at a point where the system actually just *barely* might work.
I'll know tomorrow when it's done. But the system might be used to
work on much larger filesystems, and then the party is over.

I'm looking for the best suited way of attacking the problem. Is it a
memory map file? A Berkeley DB? A specially crafted metadata
filesystem? (that would be fun but probably overkill.. or something
completely different.

How about good old textfile in filesystem, ie.
/some/dir/.timestamp
? This way, you don't use up memory, you get hierarchial data structure
automatically, no need to mess with database interface, simple search
mechanism, etc. And, most of all, no need to ask other people who have
absolutely no idea what you're talking about...

Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
a huge shared read-only data in parallel accesses -- How?multithreading? multiprocessing?	9	Dec 9, 2009
Text processing	29	Sep 26, 2011
Processing a huge xml file	12	Jul 23, 2007
Processing huge files	0	Jun 27, 2006
emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
Huge files manipulation	12	Nov 10, 2008
Efficient processing of large nuumeric data file	12	Jan 18, 2008

Processing huge datasets

Anders =?UTF-8?B?U8O4bmRlcmdhYXJk?=

Aahz

Thomas Guettler

Dennis Lee Bieber

Terry Reedy

Anders S. Jensen

William Park

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads