A
Anders =?UTF-8?B?U8O4bmRlcmdhYXJk?=
Hi,
I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.
I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.
First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.
Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?
I'm looking at rewriting the code to operate on parts of the hierarchy at a
time and store the processed data structure in another Berkeley DB so I can
query that afterwards. But I'd really prefer keeping all things in memory
due to the huge performance gain.
Any pointers?
Cheers, Anders
I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.
I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.
First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.
Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?
I'm looking at rewriting the code to operate on parts of the hierarchy at a
time and store the processed data structure in another Berkeley DB so I can
query that afterwards. But I'd really prefer keeping all things in memory
due to the huge performance gain.
Any pointers?
Cheers, Anders