GC is very expensive: am I doing something wrong?

W

Weeble

I am loading a dictionary from a text file and constructing a trie
data structure in memory. However, it takes longer than I'm happy with
- about 12 seconds on my computer. I profiled it, came up with some
clever ideas to cut down on the work (such as by exploiting the fact
that the dictionary is sorted) and was only able to shave a small
fraction of the time off. However, then I tried calling gc.disable()
before loading the trie and it halved the running time! I was
surprised. Is that normal? I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.

I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards? Does the fact that the garbage collector is taking so much
time indicate I'm doing something particularly bad here? Is there some
way to give the garbage collector a hint that the whole trie is going
to be long-lived and get it promoted straight to generation 2 rather
than scanning it over and over?

$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m12.523s
user 0m12.380s
sys 0m0.140s
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m12.592s
user 0m12.480s
sys 0m0.110s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m6.176s
user 0m5.980s
sys 0m0.190s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m6.331s
user 0m5.530s
sys 0m0.170s


=== trie.py ===

class Trie(object):
__slots__=("root", "active")
def __init__(self):
self.root=[]
self.active=False
def insert(self, word):
if len(word) == 0:
self.active=True
else:
head = word[0]
for ch, child in reversed(self.root):
if ch == head:
child.insert(word[1:])
return
child = Trie()
self.root.append((head, child))
child.insert(word[1:])
def seek(self, word):
if len(word) == 0:
return self
head = word[0]
for ch, child in self.root:
if ch == head:
return child.seek(word[1:])
return EMPTY_TRIE
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.root) and not self.active
def endings(self, prefix=""):
if self.active:
yield prefix
for ch, child in self.root:
for ending in child.endings(prefix+ch):
yield ending

EMPTY_TRIE = Trie()
 
T

Terry Reedy

I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.

I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards?

I believe not. It is known that certain patterns of object creation and
destruction can lead to bad gc behavior. No one has discovered a setting
of the internal tuning parameters for which there are no bad patterns
and I suspect there are not any such. This does not negate Xavier's
suggestion that a code change might also solve your problem.

tjr
 
P

Patrick Maupin

I am loading a dictionary from a text file and constructing a trie
data structure in memory. However, it takes longer than I'm happy with
- about 12 seconds on my computer. I profiled it, came up with some
clever ideas to cut down on the work (such as by exploiting the fact
that the dictionary is sorted) and was only able to shave a small
fraction of the time off. However, then I tried calling gc.disable()
before loading the trie and it halved the running time! I was
surprised. Is that normal? I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.

Well, you are creating and destroying a lot of objects in the process,
so that will provoke the garbage collector. But you are also doing
reversing and searching, and that's slow. Does your application
really need to be able to keep things in order like this, or do you
just want to know if a word is in the dictionary? If you just want to
load up a dictionary and be able to see if words are in it, I would
use a dict instead of a list.
Even if you want to be able to print out the data in order, you might
consider using a dict instead of a list. The print operation could
use one sort at the end, instead of searching all the nodes on
insertion.

You could also use a character that doesn't appear in your data as a
sentinel, and then you don't need a separate active indicator -- every
leaf node will be empty and be referred to by the sentinel character.

You are also doing a lot of recursive operations that could be done
without recursing.

Finally, do you really need to keep an additional object around for
each node in the tree?

I have modified your trie code to use a dict and a sentinel, while
keeping basically the same API. This may or may not be the best way
to do this, depending on your other code which uses this data
structure. It could also probably be made faster by removing the
setdefault, and not re-building objects when you need them, and even
this implementation will load faster if you disable the gc, but in any
case, this might give you some ideas about how to make your code go
faster.

Regards,
Pat


from collections import defaultdict

class TrieTop(object):
sentinel = ' ' # Something not in the data

def __init__(self, data=None):
def defaultrecurse():
return defaultdict(defaultrecurse)
if data is None:
data = defaultrecurse()
self.data = data
def insert(self, word):
data = self.data
for ch in word:
data = data[ch]
data[self.sentinel]
def seek(self, word):
data = self.data
for ch in word:
data = data.get(ch)
if data is None:
return EMPTY_TRIE
return TrieTop(data)
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.data)
def endings(self, prefix=""):
def recurse(data, prefix):
if not data:
yield prefix[:-1]
return
for ch, data in data.iteritems():
for result in recurse(data, prefix + ch):
yield result
return sorted(recurse(self.data, prefix))

EMPTY_TRIE = TrieTop()
 
L

Lawrence D'Oliveiro

Terry said:
No one has discovered a setting
of the internal tuning parameters for which there are no bad patterns
and I suspect there are not any such. This does not negate Xavier's
suggestion that a code change might also solve your problem.

Could it be that for implementing a structure like a trie as the OP is,
where a lot of CPU cycles can be spent manipulating the structure, a high-
level language like Python, Perl or Ruby just gets in the way?

My feeling would be, try to get the language to do as much of the work for
you as possible. If you can’t do that, then you might be better off with a
lower-level language.
 
S

Stefan Behnel

Lawrence D'Oliveiro, 22.03.2010 00:36:
Could it be that for implementing a structure like a trie as the OP is,
where a lot of CPU cycles can be spent manipulating the structure, a high-
level language like Python, Perl or Ruby just gets in the way?

I would rather say that the specific problem of the trie data structure is
that it has extremely little benefit over other available data structures.
There may still be a couple of niches where it makes sense to consider it
as an alternative, but given that dicts are so heavily optimised in Python,
it'll be hard for tries to compete even when written in a low-level language.

Remember that the original use case was to load a dictionary from a text
file. For this use case, a trie can be very wasteful in terms of memory and
rather CPU cache unfriendly on traversal, whereas hash values are a) rather
fast to calculate for a string, and b) often just calculated once and then
kept alive in the string object for later reuse.

My feeling would be, try to get the language to do as much of the work for
you as possible. If you can’t do that, then you might be better off with a
lower-level language.

I agree with the first sentence, but I'd like to underline the word 'might'
in the second. As this newsgroup shows, very often it's enough to look for
a better algorithmic approach first.

Stefan
 
T

tan

Lawrence D'Oliveiro, 22.03.2010 00:36:

I would rather say that the specific problem of the trie data structure is
that it has extremely little benefit over other available data structures.

Not true.
There may still be a couple of niches where it makes sense to consider it
as an alternative, but given that dicts are so heavily optimised in Python,
it'll be hard for tries to compete even when written in a low-level language.

It depends. If your data is not in nearly sorted order,
trees are some of the best mechanisms available.
Remember that the original use case was to load a dictionary from a text
file. For this use case, a trie can be very wasteful in terms of memory and
rather CPU cache unfriendly on traversal, whereas hash values are a) rather
fast to calculate for a string, and b) often just calculated once and then
kept alive in the string object for later reuse.

You still have to walk the bucket in a hash map/table.
Performance may be orders of magnitude worse than for trees.
I agree with the first sentence, but I'd like to underline the word 'might'
in the second. As this newsgroup shows, very often it's enough to look for
a better algorithmic approach first.

Stefan

--
You want to know who you are?

http://oshosearch.net/Convert/search.php

Most Osho books on line:

http://oshosearch.net
 
A

Antoine Pitrou

Le Mon, 22 Mar 2010 23:40:16 +0000, tan a écrit :
You still have to walk the bucket in a hash map/table. Performance may
be orders of magnitude worse than for trees.

"walk the bucket" shouldn't be a significant cost factor here, especially
if you are doing meaningful work with the traversed items. In the CPython
implementation the total hash table size is less than a constant times
the number of actual items. Moreover, it's a linear scan over an array
rather than having to dereference pointers as in tree.

"Orders of magnitude worse", in any case, sounds very exaggerated.

(and, of course, as the OP said, there's the major difference that the
dict type is implemented in C, which makes constant factors an order of
magnitude smaller than for a Python trie implementation)
 
P

Paul Rubin

Antoine Pitrou said:
"Orders of magnitude worse", in any case, sounds very exaggerated.

The worst case can lose orders of magnitude if a lot of values hash
to the same bucket.
 
S

Steven D'Aprano

The worst case can lose orders of magnitude if a lot of values hash to
the same bucket.


Well, perhaps one order of magnitude.

.... n = 32*i+1
.... assert hash(2**n) == hash(2)
....
d1 = dict.fromkeys(xrange(100))
d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])

from timeit import Timer
setup = "from __main__ import d1, d2"
t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
t2 = Timer("for k in d2.keys(): x = d2[k]", setup)

min(t1.repeat(number=1000, repeat=5)) 0.026707887649536133
min(t2.repeat(number=1000, repeat=5))
0.33103203773498535
 
P

Peter Otten

Steven said:
The worst case can lose orders of magnitude if a lot of values hash to
the same bucket.


Well, perhaps one order of magnitude.

... n = 32*i+1
... assert hash(2**n) == hash(2)
...
d1 = dict.fromkeys(xrange(100))
d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])

from timeit import Timer
setup = "from __main__ import d1, d2"
t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
t2 = Timer("for k in d2.keys(): x = d2[k]", setup)

min(t1.repeat(number=1000, repeat=5)) 0.026707887649536133
min(t2.repeat(number=1000, repeat=5))
0.33103203773498535

But the ratio grows with the number of collisions:

$ python extrapolate.py
10
0.00120401382446
0.00753307342529
ratio: 6.25663366337

100
0.00542402267456
0.316139936447
ratio: 58.2851428571

1000
0.00553417205811
3.36690688133
ratio: 608.384930209

$ cat extrapolate.py
from timeit import Timer

class Item(object):
def __init__(self, value, hash=None):
self.value = value
self.hash = value if hash is None else hash
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return self.hash

setup = "from __main__ import d"
bench = "for k in d: x = d[k]"

for n, number in (10,100), (100,100), (1000,10):
print n
d1 = dict.fromkeys(Item(i) for i in xrange(n))
d2 = dict.fromkeys(Item(i, 0) for i in xrange(n))
ab = []
for d in d1, d2:
t = Timer(bench, setup)
ab.append(min(t.repeat(number=number, repeat=3)))
print ab[-1]
print "ratio:", ab[1]/ab[0]
print


See also http://xkcd.com/605/

Peter
 
P

Paul Rubin

Stefan Behnel said:
While this is theoretically true, and it's good to be aware of this
possibility, common string hash functions make it so rare in practice
that a hash table will almost always outperform a trie for exact
lookups. If it happens, it will either show up clearly enough in
benchmarks or not be worth bothering.

It is unlikely to happen by accident. You might care that it can
happen on purpose. See: http://www.cs.rice.edu/~scrosby/hash/
that I cited in another post. The article shows some sample attacks
on Python cgi's.
 
S

Stefan Behnel

Paul Rubin, 23.03.2010 06:05:
The worst case can lose orders of magnitude if a lot of values hash
to the same bucket.

While this is theoretically true, and it's good to be aware of this
possibility, common string hash functions make it so rare in practice that
a hash table will almost always outperform a trie for exact lookups. If it
happens, it will either show up clearly enough in benchmarks or not be
worth bothering.

Stefan
 
A

Antoine Pitrou

Le Tue, 23 Mar 2010 02:57:56 -0700, Paul Rubin a écrit :
It is unlikely to happen by accident. You might care that it can happen
on purpose. See: http://www.cs.rice.edu/~scrosby/hash/ that I cited in
another post. The article shows some sample attacks on Python cgi's.

Certainly interesting in a purely academic point of view, but in real
life if you want to cause a denial of service by overwhelming a server,
there are far more obvious options than trying to guess the server's use
of hash tables and trying to cause lots of collisions in them.
 
P

Paul Rubin

Antoine Pitrou said:
Certainly interesting in a purely academic point of view, but in real
life if you want to cause a denial of service by overwhelming a server,
there are far more obvious options than trying to guess the server's use
of hash tables and trying to cause lots of collisions in them.

If you look at the very low bandwidth used in some of those hashtable
attacks, it's hard to see any other somewhat-generic attack that's
comparably effective. Usually we think of "DOS" as involving massive
botnets and the like, not a dribble of a few hundred characters per
second.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top