Defaultdict and speed

bearophileHUGS · Nov 3, 2006

This post sums some things I have written in another Python newsgroup.
More than 40% of the times I use defaultdict like this, to count
things:

from collections import defaultdict as DD
s = "abracadabra"
d = DD(int)
for c in s: d[c] += 1 ....
d

Click to expand...

Click to expand...

defaultdict(<type 'int'>, {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1})

But I have seen that if keys are quite sparse, and int() becomes called
too much often, then code like this is faster:
.... if c in d: d[c] += 1
.... else: d[c] = 1
....{'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}

So to improve the speed for such special but common situation, the
defaultdict can manage the case with default_factory=int in a different
and faster way.

Bye,
bearophile

Klaas · Nov 4, 2006

This post sums some things I have written in another Python newsgroup.
More than 40% of the times I use defaultdict like this, to count
things:

from collections import defaultdict as DD
s = "abracadabra"
d = DD(int)
for c in s: d[c] += 1 ...
d

Click to expand...

Click to expand...

defaultdict(<type 'int'>, {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1})

But I have seen that if keys are quite sparse, and int() becomes called
too much often, then code like this is faster:
... if c in d: d[c] += 1
... else: d[c] = 1
...{'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}

So to improve the speed for such special but common situation, the
defaultdict can manage the case with default_factory=int in a different
and faster way.

Benchmarks? I doubt it is worth complicating defaultdict's code (and
slowing down other uses of the class) for this improvement...
especially when the faster alternative is so easy to code. If that
performance difference matters, you would likely find more fruitful
gains in coding it in c, using PyDict_SET.

-Mike

bearophileHUGS · Nov 4, 2006

Klaas said:
Benchmarks?

There is one (fixed in a succesive post) in the original thread I was
referring to:
http://groups.google.com/group/it.comp.lang.python/browse_thread/thread/aff60c644969f9b/
If you want I can give more of them (and a bit less silly, with strings
too, etc).

def ddict(n):
t = clock()
d = defaultdict(int)
for i in xrange(n):
d += 1
print round(clock()-t, 2)

def ndict(n):
t = clock()
d = {}
for i in xrange(n):
if i in d:
d += 1
else:
d = 1
print round(clock()-t, 2)

ddict(300000)
ndict(300000)

(and slowing down other uses of the class)

Click to expand...

All it has to do is to cheek if the default_factory is an int, it's
just an "if" done only once, so I don't think it slows down the other
cases significantly.

especially when the faster alternative is so easy to code.

Click to expand...

The faster alternative is easy to create, but the best faster
alternative can't be coded, because if you code it in Python you need
two hash accesses, while the defaultdict can require only one of them:

if n in d:
d[n] += 1
else:
d[n] = 1

If that performance difference matters,

Click to expand...

With Python it's usually difficult to tell if some performance
difference matters. Probably in some programs it may matter, but in
most other programs it doesn't matter. This is probably true for all
the performance tweaks I may invent in the future too.

you would likely find more fruitful
gains in coding it in c, using PyDict_SET

Click to expand...

I've just started creating a C lib for related purposes, I'd like to
show it to you all on c.l.p, but first I have to find a place to put it
on (It's not easy to find a suitable place, it's a python + c +
pyd, and it's mostly an exercise).

Bye,
bearophile

Klaas · Nov 5, 2006

There is one (fixed in a succesive post) in the original thread I was
referring to:
http://groups.google.com/group/it.comp.lang.python/browse_thread/thread/aff60c644969f9b/
If you want I can give more of them (and a bit less silly, with strings
too, etc).

<>

Sorry, I didn't see any numbers. I ran it myself and found the
defaultdict version to be approximately twice as slow. This, as you
suggest, is the worst case, as you are using integers as hash keys
(essentially no hashing cost) and are accessing each key exactly once.

All it has to do is to cheek if the default_factory is an int, it's
just an "if" done only once, so I don't think it slows down the other
cases significantly.

Once it makes that check, surely it must check a flag or some such
every time it is about to invoke the key constructor function?

especially when the faster alternative is so easy to code.

Click to expand...

The faster alternative is easy to create, but the best faster
alternative can't be coded, because if you code it in Python you need
two hash accesses, while the defaultdict can require only one of them:

if n in d:
d[n] += 1
else:
d[n] = 1

How do you think that defaultdict is implemented? It must perform the
dictionary access to determine that the value is missing. It must then
go through the method dispatch machinery to look for the __missing__
method, and execute it. If you _really_ want to make this fast, you
should write a custom distionary subclass which accepts an object (not
function) as default value, and assigns it directly.

With Python it's usually difficult to tell if some performance
difference matters. Probably in some programs it may matter, but in
most other programs it doesn't matter. This is probably true for all
the performance tweaks I may invent in the future too.

In general, I agree, but in this case it is quite clear. The only
possible speed up is for defaultdict(int). The re-write using regular
dicts is trivial, hence, for given piece of code is it quite clear
whether the performance gain is important. This is not an
interpreter-wide change, after all.

Consider also that the performance gains would be relatively
unsubstantial when more complicated keys and a more realistic data
distribution is used. Consider further that the __missing__ machinery
would still be called. Would the resulting construct be faster than
the use of a vanilla dict? I doubt it.

But you can prove me wrong by implementing it and benchmarking it.

I've just started creating a C lib for related purposes, I'd like to
show it to you all on c.l.p, but first I have to find a place to put it
on (It's not easy to find a suitable place, it's a python + c +
pyd, and it's mostly an exercise).

Would suggesting a webpage be too trite?

-Mike

Why defaultdict?	6	Jul 2, 2010
pprint defaultdict one record per line	0	Mar 17, 2013
defaultdict of arbitrary depth	6	Aug 17, 2007
problems with shelve(), collections.defaultdict, self	6	Feb 11, 2012
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Function is not worked in C	2	Jun 27, 2023
Need help! Following code isnt working fully Comparison of integer and pointer	0	Nov 20, 2022
Linux: using "clone3" and "waitid"	0	Oct 17, 2023

Defaultdict and speed

bearophileHUGS

Klaas

bearophileHUGS

Klaas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads