Creating Long Lists

K

Kelson Zawack

I have a large (10gb) data file for which I want to parse each line into
an object and then append this object to a list for sorting and further
processing. I have noticed however that as the length of the list
increases the rate at which objects are added to it decreases
dramatically. My first thought was that I was nearing the memory
capacity of the machine and the decrease in performance was due to the
os swapping things in and out of memory. When I looked at the memory
usage this was not the case. My process was the only job running and
was consuming 40gb of the the total 130gb and no swapping processes were
running. To make sure there was not some problem with the rest of my
code, or the servers file system, I ran my program again as it was but
without the line that was appending items to the list and it completed
without problem indicating that the decrease in performance is the
result of some part of the process of appending to the list. Since
other people have observed this problem as well
(http://tek-tips.com/viewthread.cfm?qid=1096178&page=13,
http://stackoverflow.com/questions/...n-list-append-becoming-progressively-slower-i)
I did not bother to further analyze or benchmark it. Since the answers
in the above forums do not seem very definitive I thought I would
inquire here about what the reason for this decrease in performance is,
and if there is a way, or another data structure, that would avoid this
problem.
 
A

alex23

I did not bother to further analyze or benchmark it.  Since the answers
in the above forums do not seem very definitive  I thought  I would
inquire here about what the reason for this decrease in performance is,
and if there is a way, or another data structure, that would avoid this
problem.

The first link is 6 years old and refers to Python 2.4. Unless you're
using 2.4 you should probably ignore it.

The first answer on the stackoverflow link was accepted by the poster
as resolving his issue. Try disabling garbage collection.
 
J

John Bokma

alex23 said:
The first link is 6 years old and refers to Python 2.4. Unless you're
using 2.4 you should probably ignore it.

The first answer on the stackoverflow link was accepted by the poster
as resolving his issue. Try disabling garbage collection.

I just read http://bugs.python.org/issue4074 which discusses a patch
that has been included 2 years ago. So using a recent Python 2.x also
doesn't have this issue?
 
K

Kelson Zawack

The answer it turns out is the garbage collector. When I disable the
garbage collector before the loop that loads the data into the list
and then enable it after the loop the program runs without issue.
This raises a question though, can the logic of the garbage collector
be changed so that it is not triggered in instances like this were you
really do want to put lots and lots of stuff in memory. Turning on
and off the garbage collector is not a big deal, but it would
obviously be nicer not to have to.
 
K

Kelson Zawack

I am using python 2.6.2, so it may no longer be a problem.

I am open to using another data type, but the way I read the
documentation array.array only supports numeric types, not arbitrary
objects. I also tried playing around with numpy arrays, albeit for
only a short time, and it seems that although they do support
arbitrary objects they seem to be geared toward numbers as well and I
found it cumbersome to manipulate objects with them. It could be
though that if I understood them better they would work fine. Also do
numpy arrays support sorting arbitrary objects, I only saw a method
that sorts numbers.
 
T

Terry Reedy

The answer it turns out is the garbage collector. When I disable the
garbage collector before the loop that loads the data into the list
and then enable it after the loop the program runs without issue.
This raises a question though, can the logic of the garbage collector
be changed so that it is not triggered in instances like this were you
really do want to put lots and lots of stuff in memory. Turning on
and off the garbage collector is not a big deal, but it would
obviously be nicer not to have to.

Heuristics, by their very nature, are not correct in all situations.
 
J

Jorgen Grahn

What is the nature of the further processing?

Does that further processing access the items sequentially? If so, they
don't all need to be in memory at once, and you can produce them with a
generator <URL:http://docs.python.org/glossary.html#term-generator>.

He mentioned sorting them -- you need all of them for that.

If that's the *only* such use, I'd experiment with writing them as
sortable text to file, and run GNU sort (the Unix utility) on the file.
It seems to have a clever file-backed sort algorithm.

/Jorgen
 
T

Tim Wintle

If that's the *only* such use, I'd experiment with writing them as
sortable text to file, and run GNU sort (the Unix utility) on the file.
It seems to have a clever file-backed sort algorithm.

+1 - and experiment with the different flags to sort (compression of
intermediate results, intermediate batch size, etc)

Tim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top