gc penalty of 30-40% when manipulating large data structures?

A

Aaron Watters

Poking around I discovered somewhere someone saying that
Python gc adds a 4-7% speed penalty.

So since I was pretty sure I was not creating
reference cycles in nucular I tried running the tests with garbage
collection disabled.

To my delight I found that index builds run 30-40% faster without
gc. This is really nice because testing gc.collect() afterward
shows that gc was not actually doing anything.

I haven't analyzed memory consumption but I suspect that should
be significantly improved also, since the index builds construct
some fairly large data structures with lots of references for a
garbage collector to keep track of.

Somewhere someone should mention the possibility that disabling
gc can greatly improve performance with no down side if you
don't create reference cycles. I couldn't find anything like this
on the Python site or elsewhere. As Paul (I think) said, this should
be a FAQ.

Further, maybe Python should include some sort of "backoff"
heuristic which might go like this: If gc didn't find anything and
memory size is stable, wait longer for the next gc cycle. It's
silly to have gc kicking in thousands of times in a multi-hour
run, finding nothing every time.

Just my 2c.
-- Aaron Watters

nucular full text fielded indexing: http://nucular.sourceforge.net
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=dingus fish
 
C

Chris Mellon

Poking around I discovered somewhere someone saying that
Python gc adds a 4-7% speed penalty.

So since I was pretty sure I was not creating
reference cycles in nucular I tried running the tests with garbage
collection disabled.

To my delight I found that index builds run 30-40% faster without
gc. This is really nice because testing gc.collect() afterward
shows that gc was not actually doing anything.

I haven't analyzed memory consumption but I suspect that should
be significantly improved also, since the index builds construct
some fairly large data structures with lots of references for a
garbage collector to keep track of.

Somewhere someone should mention the possibility that disabling
gc can greatly improve performance with no down side if you
don't create reference cycles. I couldn't find anything like this
on the Python site or elsewhere. As Paul (I think) said, this should
be a FAQ.

Further, maybe Python should include some sort of "backoff"
heuristic which might go like this: If gc didn't find anything and
memory size is stable, wait longer for the next gc cycle. It's
silly to have gc kicking in thousands of times in a multi-hour
run, finding nothing every time.

The GC has a heuristic where it kicks in when (allocations -
deallocations) exceeds a certain threshold, which has (sometimes quite
severe) implications for building large indexes. This doesn't seem to
be very well known (it's come up at least 3-4 times on this list in
the last 6 months) and the heuristic is probably not a very good one.
If you have some ideas for improvements, you can read about the
current GC in the gc module docs (as well as in the source) and can
post them on python-ideas.
 
I

Istvan Albert

The GC has a heuristic where it kicks in when (allocations -
deallocations) exceeds a certain threshold,

As the available ram increases this threshold can be more easily
reached. Ever since I moved to 2Gb ram I stumbled upon issues that
were easily solved by turning the gc off (the truth is that more ram
made me lazier, I'm a little less keen to keep memory consumption down
for occasional jobs, being overly cavalier with generating lists of
1Gb in size...)

One example, when moving from a list size from 1 million to 10 million
I hit this threshold. Nowadays I disable the gc during data
initialization.

i.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top