W
Weeble
I am loading a dictionary from a text file and constructing a trie
data structure in memory. However, it takes longer than I'm happy with
- about 12 seconds on my computer. I profiled it, came up with some
clever ideas to cut down on the work (such as by exploiting the fact
that the dictionary is sorted) and was only able to shave a small
fraction of the time off. However, then I tried calling gc.disable()
before loading the trie and it halved the running time! I was
surprised. Is that normal? I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.
I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards? Does the fact that the garbage collector is taking so much
time indicate I'm doing something particularly bad here? Is there some
way to give the garbage collector a hint that the whole trie is going
to be long-lived and get it promoted straight to generation 2 rather
than scanning it over and over?
$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m12.523s
user 0m12.380s
sys 0m0.140s
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m12.592s
user 0m12.480s
sys 0m0.110s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m6.176s
user 0m5.980s
sys 0m0.190s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m6.331s
user 0m5.530s
sys 0m0.170s
=== trie.py ===
class Trie(object):
__slots__=("root", "active")
def __init__(self):
self.root=[]
self.active=False
def insert(self, word):
if len(word) == 0:
self.active=True
else:
head = word[0]
for ch, child in reversed(self.root):
if ch == head:
child.insert(word[1:])
return
child = Trie()
self.root.append((head, child))
child.insert(word[1:])
def seek(self, word):
if len(word) == 0:
return self
head = word[0]
for ch, child in self.root:
if ch == head:
return child.seek(word[1:])
return EMPTY_TRIE
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.root) and not self.active
def endings(self, prefix=""):
if self.active:
yield prefix
for ch, child in self.root:
for ending in child.endings(prefix+ch):
yield ending
EMPTY_TRIE = Trie()
data structure in memory. However, it takes longer than I'm happy with
- about 12 seconds on my computer. I profiled it, came up with some
clever ideas to cut down on the work (such as by exploiting the fact
that the dictionary is sorted) and was only able to shave a small
fraction of the time off. However, then I tried calling gc.disable()
before loading the trie and it halved the running time! I was
surprised. Is that normal? I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.
I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards? Does the fact that the garbage collector is taking so much
time indicate I'm doing something particularly bad here? Is there some
way to give the garbage collector a hint that the whole trie is going
to be long-lived and get it promoted straight to generation 2 rather
than scanning it over and over?
$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m12.523s
user 0m12.380s
sys 0m0.140s
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m12.592s
user 0m12.480s
sys 0m0.110s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m6.176s
user 0m5.980s
sys 0m0.190s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m6.331s
user 0m5.530s
sys 0m0.170s
=== trie.py ===
class Trie(object):
__slots__=("root", "active")
def __init__(self):
self.root=[]
self.active=False
def insert(self, word):
if len(word) == 0:
self.active=True
else:
head = word[0]
for ch, child in reversed(self.root):
if ch == head:
child.insert(word[1:])
return
child = Trie()
self.root.append((head, child))
child.insert(word[1:])
def seek(self, word):
if len(word) == 0:
return self
head = word[0]
for ch, child in self.root:
if ch == head:
return child.seek(word[1:])
return EMPTY_TRIE
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.root) and not self.active
def endings(self, prefix=""):
if self.active:
yield prefix
for ch, child in self.root:
for ending in child.endings(prefix+ch):
yield ending
EMPTY_TRIE = Trie()