optimizing large dictionaries

P

Per Freem

hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
try:
elt = MyClass(line)# extract elt from line...
my_dict[elt] += 1
except KeyError:
my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

def __str__(self):
return "%s-%s-%s" %(self.field1, self.field2, self.field3)

def __repr__(self):
return str(self)

def __hash__(self):
return hash(str(self))


is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.
 
M

Matimus

hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
  try:
    elt = MyClass(line)# extract elt from line...
    my_dict[elt] += 1
  except KeyError:
    my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

  def __str__(self):
    return "%s-%s-%s" %(self.field1, self.field2, self.field3)

  def __repr__(self):
    return str(self)

  def __hash__(self):
    return hash(str(self))

is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.

You can use a tuple instead of a string, which should be a little
quicker:

def __hash__(self):
return self.field1, self.field2, self.field3

You could speed it up even more if you just saved a single attribute
"fields" as a tuple to begin with.

Also, you can use defauldict and get rid of the try/except. I don't
think try/except is slow, but avoiding it will give you a speed up.

from collections import defaultdict
my_dict = defaultdict(int)

for line in file:
elt = MyClass(line)# extract elt from line...
my_dict[elt] += 1


You might even consider turning "MyClass" into just a function that
extracts the values from the line and returns a tuple, which should
give you even more of a boost since a tuple is completely implemented
in C.


Matt
 
B

bearophileHUGS

Matimus, your suggestions are all good.

Try-except is slower than:
if x in adict: ... else: ...
A defaultdict is generally faster (there are some conditions when it's
not faster, but they aren't much common. I think it's when the ratio
of duplicates is really low), creating just a tuple instead of a class
helps a lot, and when the CPU/OS allow it, Psyco too may help some
here.

If the resulting speed isn't enough yet, consider that Python dicts
are quite fast, so you may need lot of care to write D/C/C++/Clisp
code that's faster for this problem.

I also suspect that when they become very large, Python dicts lose
some of their efficiency. If this is true, you may switch to a new
dictionary every chunk of file, and then merge the dicts at the end. I
don't actually know if this may speed up your Python code even more
(if you experiment this, I'd like to know if it's false).

Bye,
bearophile
 
S

Steven D'Aprano

class MyClass(object):
# a new style class with slots saves some memory
__slots__ = ("field1", "field2", "field2")

I was curious whether using slots would speed up attribute access.
.... def __init__(self, a, b, c):
.... self.a = a
.... self.b = b
.... self.c = c
........ __slots__ = 'a', 'b', 'c'
.... def __init__(self, a, b, c):
.... self.a = a
.... self.b = b
.... self.c = c
....
p = Parrot(23, "something", [1, 2, 3])
sp = SlottedParrot(23, "something", [1, 2, 3])

from timeit import Timer
setup = "from __main__ import p, sp"
t1 = Timer('p.a, p.b, p.c', setup)
t2 = Timer('sp.a, sp.b, sp.c', setup)
min(t1.repeat()) 0.83308887481689453
min(t2.repeat())
0.62758088111877441


That's not a bad improvement. I knew that __slots__ was designed to
reduce memory consumption, but I didn't realise they were faster as well.
 
P

Per Freem

thanks to everyone for the excellent suggestions. a few follow up q's:

1] is Try-Except really slower? my dict actually has two layers, so
my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
as the bKeys are the ones that are in the millions. so in that case,
doing a Try-Except on aKey should be very efficient, since often it
will not fail, where as if I do: "if aKey in my_dict", that statement
will get executed for each aKey. can someone definitely say whether
Try-Except is faster or not? My benchmarks aren't conclusive and i
hear it both ways from several people (though majority thinks
TryExcept is faster).

2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.

3] more importantly, is there likely to be a big improvement for
splitting up one big dictionary into several smaller ones? if so, is
there a straight forward elegant way to implement this? the way i am
thinking is to just fix a number of dicts and populate them with
elements. then during retrieval, try the first dict, if that fails,
try the second, if not the third, etc... but i can imagine how that's
more likely to lead to bugs / debugging give the way my code is setup
so i am wondering whether it is really worth it.
if it can lead to a factor of 2 difference, i will definitely
implement it -- does anyone have experience with this?

class MyClass(object):
    # a new style class with slots saves some memory
    __slots__ = ("field1", "field2", "field2")

I was curious whether using slots would speed up attribute access.

...     def __init__(self, a, b, c):
...             self.a = a
...             self.b = b
...             self.c = c
...>>> class SlottedParrot(object):

...     __slots__ = 'a', 'b', 'c'
...     def __init__(self, a, b, c):
...             self.a = a
...             self.b = b
...             self.c = c
...
p = Parrot(23, "something", [1, 2, 3])
sp = SlottedParrot(23, "something", [1, 2, 3])
from timeit import Timer
setup = "from __main__ import p, sp"
t1 = Timer('p.a, p.b, p.c', setup)
t2 = Timer('sp.a, sp.b, sp.c', setup)
min(t1.repeat()) 0.83308887481689453
min(t2.repeat())

0.62758088111877441

That's not a bad improvement. I knew that __slots__ was designed to
reduce memory consumption, but I didn't realise they were faster as well.
 
P

Paul Rubin

Per Freem said:
2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.

my_dict = defaultdict(lambda: defaultdict(int))
 
S

Steven D'Aprano

Matimus, your suggestions are all good.

Try-except is slower than:
if x in adict: ... else: ...



Not according to my tests.

.... try:
.... return D[key]
.... except KeyError:
.... return None
........ if key in D:
.... return D[key]
.... else:
.... return None
....0.71856999397277832


Perhaps what you meant to say is that try...except is slower *if the key
is not found*. And in that case, there's a truly massive difference:

0.5983281135559082

Which solution is faster on average will depend on the ratio of keys
found versus keys not found. On my PC, based on the above results,
if...else becomes faster when just 2% of keys are not found.

But if you can arrange matters so that nearly all keys are found, then
try...except can be faster.


There's another solution nobody has mentioned (as far as I can see): the
get method. In my test I call get() from inside a function, rather than
directly, so that the timing results include the function call overhead
for all three strategies.

.... return D.get(key)
....0.87782382965087891
 
P

Paul McGuire

...the aKeys are very small (less than 100) where
as the bKeys are the ones that are in the millions.  so in that case,
doing a Try-Except on aKey should be very efficient, since often it
will not fail, ...

Do you know the aKeys in advance? If so, then just initialize the
first layer, and it will never fail.

-- Paul
 
P

pruebauno

hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
  try:
    elt = MyClass(line)# extract elt from line...
    my_dict[elt] += 1
  except KeyError:
    my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

  def __str__(self):
    return "%s-%s-%s" %(self.field1, self.field2, self.field3)

  def __repr__(self):
    return str(self)

  def __hash__(self):
    return hash(str(self))

is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.

I am willing to bet a beer that the slow performance has nothing to do
with the dict but is either because of MyClass or the read from disk.
 
M

Matthias Julius

Per Freem said:
the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so
my_dict[elt] works well. the __repr__ and __hash__ methods of my
class simply return str() representation of self,

which just calls __str__(). I guess you are aware of that but you
could call self.__str__() directly. Maybe that saves something when
you do that 10 million times.
while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

def __str__(self):
return "%s-%s-%s" %(self.field1, self.field2, self.field3)

def __repr__(self):
return str(self)

def __hash__(self):
return hash(str(self))

Maybe it would be faster to numerically combine the three fields
instead of hashing the string representation.

Matthias
 
L

Luis M. González

hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
  try:
    elt = MyClass(line)# extract elt from line...
    my_dict[elt] += 1
  except KeyError:
    my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

  def __str__(self):
    return "%s-%s-%s" %(self.field1, self.field2, self.field3)

  def __repr__(self):
    return str(self)

  def __hash__(self):
    return hash(str(self))

is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.

How about this?:

for line in file:
elt = MyClass(line)
my_dict[elt] = my_dict.get(elt, 0) + 1
...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top