how fast is Python code - another detail

G

Gregor Lingl

"Quality of design" may - for instance - consist
in deleting 7 characters:

Today I made - by accident - an interesting
observation. The following is a snippet from
a program which computes anagrams from a wordlist:

def find_anagrams(wordlist):
d = {}
for word in wordlist:
sig = anagram_signature(word)
if sig not in d:
d[sig] = [] # d.setdefault intentionally not used
return d

On my machine it computes d
for a wordlist of 10000 items in less than 0.13s
and for a wordlist of 20000 items in less than 0.25s

Today one of my students inadvertedly wrote in the fifth line

if sig not in d.keys():

and now timing of find_anagrams had the following results:

for a wordlist of 10000 items approx. 9.33s, i.e a factor
of approx 65 slower

and for a wordlist of 20000 items approx. 60s, i.e. a factor
of approx 250 slower. (!!!) (Imagine that the full wordlist
in this example has 45000 entires, so an additional factor of
16 applies ...)

I think this is a very astonishing and relevant difference!
I think I understand why it is the case - the timing factor
is proportional to the square of len(wordlist), because
in every loop a list d.keys() has to be computed and now
this has to be used for searching for the key sig,
while for sig in d uses some built-in generator-property
of dictionaries ... (?)

*on_the_other_ hand* I thought, that those two idioms
are semantically identical. So my question: why does the
Python interpreter not recognize this identity and internally
replace the latter by the first in order to get *much*
better performance?

- would this lead to wrong results in some cases?
- or would it be too difficult to implement?
- or is this considered a problem of minor importance?
- or was it only forgotten to do?
- or what else?

Regards,
Gregor

P.S.: Or was the key in dict idiom introduced exactly
inorder to solve this problem?
 
P

Peter Otten

Gregor said:
I think I understand why it is the case - the timing factor
is proportional to the square of len(wordlist), because
in every loop a list d.keys() has to be computed and now
this has to be used for searching for the key sig,
while for sig in d uses some built-in generator-property
of dictionaries ... (?)

I think you do not understand. The change from

key in theDict # old style would be theDict.has_key(key)

to

# iterkeys() can sometimes be used to avoid the creation of
# the list with all keys, you would have to introduce a for loop.
# But keys vs iterkeys is not the relevant difference here
key in theDict.keys()


changes the lookup from constant time (a hash value is calculated which is
used as an index into a list of values if all goes well) to proportonial to
len(theDict) because on average half of the list has to be searched before
the matching entry is found.

Peter
 
A

Andrew Koenig

key in theDict.keys()
changes the lookup from constant time (a hash value is calculated which is
used as an index into a list of values if all goes well) to proportonial to
len(theDict) because on average half of the list has to be searched before
the matching entry is found.

Worse: Evaluating theDict.keys() requires building a list from all the
keys. So even if the key is the first element of the list, the run time is
O(len(theDict)).

Moreover, there's no way for the compiler to optimize such expressions in
general, because it doesn't know the type of theDict during compilation.
 
G

Gregor Lingl

Peter said:
Gregor Lingl wrote:

I think you do not understand. The change from

key in theDict # old style would be theDict.has_key(key)

to

# iterkeys() can sometimes be used to avoid the creation of
# the list with all keys, you would have to introduce a for loop.
# But keys vs iterkeys is not the relevant difference here
key in theDict.keys()
Ah, yes, I didn't think of this. This is indeed nearly as
fast as key in dict.
changes the lookup from constant time (a hash value is calculated which is
used as an index into a list of values if all goes well) to proportonial to
len(theDict) because on average half of the list has to be searched before
the matching entry is found.
Yes, this is what I meant in my first posting

Thanks
Gregor
 
P

Peter Otten

Andrew said:
Worse: Evaluating theDict.keys() requires building a list from all the
keys. So even if the key is the first element of the list, the run time
is O(len(theDict)).

Might be interesting to measure the impact of boths effects separately by
keeping a copy of keys that will be updated in the if clause.
Moreover, there's no way for the compiler to optimize such expressions in
general, because it doesn't know the type of theDict during compilation.

Conclusion: the outward similarity of the

value in theList

and

key in theDict

tests *is* dangerous if you are concerned about speed.

Peter
 
V

vincent wehren

----- Original Message -----
From: "Gregor Lingl" <[email protected]>
Newsgroups: comp.lang.python
Sent: Thursday, March 04, 2004 7:43 PM
Subject: how fast is Python code - another detail

"Quality of design" may - for instance - consist
in deleting 7 characters:

Today I made - by accident - an interesting
observation. The following is a snippet from
a program which computes anagrams from a wordlist:

def find_anagrams(wordlist):
d = {}
for word in wordlist:
sig = anagram_signature(word)
if sig not in d:
d[sig] = [] # d.setdefault intentionally not used
return d

On my machine it computes d
for a wordlist of 10000 items in less than 0.13s
and for a wordlist of 20000 items in less than 0.25s

Today one of my students inadvertedly wrote in the fifth line

if sig not in d.keys():

and now timing of find_anagrams had the following results:

for a wordlist of 10000 items approx. 9.33s, i.e a factor
of approx 65 slower

and for a wordlist of 20000 items approx. 60s, i.e. a factor
of approx 250 slower. (!!!) (Imagine that the full wordlist
in this example has 45000 entires, so an additional factor of
16 applies ...)

I think this is a very astonishing and relevant difference!
I think I understand why it is the case - the timing factor
is proportional to the square of len(wordlist), because
in every loop a list d.keys() has to be computed and now
this has to be used for searching for the key sig,
while for sig in d uses some built-in generator-property
of dictionaries ... (?)

*on_the_other_ hand* I thought, that those two idioms
are semantically identical.

Nope. They are not. "if sig not in d" equates to "not d.has_key(sig)"
("O(1)" in terms of performance)
And d.keys() makes a copy of d's entire list of keys (giving you "O(N)").

You might also consider using:

try:
d[sig]
except KeyError:
d[sig]=[]

HTH,

Vincent Wehren

So my question: why does the
 
T

Terry Reedy

Gregor Lingl said:
So my question: why does the
Python interpreter not recognize this identity and internally
replace the latter by the first in order to get *much*
better performance?

The interpreter does not do optimization. 'Optimizing' complilers are
often buggy, which is to say, grossy dis-optimized -- as has been too often
discovered, from reports here and on PyDev, when compiling the interpreter
with 'optimzing' C compilers.
- would this lead to wrong results in some cases?

Yes. The identity is only an identity in this context and only dependibly
so even then for dict objects. You can easily write a class with
__contains__() and keys() methods for which optimization would be wrong.
- or would it be too difficult to implement?

Yes, writing significantly optimizing compiler than never introduces errors
more than base, straightforward compiler seems to be difficult. There has
been energy put into improving the byte code set that Python is
(straightforwardly) translated into and also into improving the C code that
is executed for each byte code.
- or is this considered a problem of minor importance?

Speed is secondary to correctness, appropriately defined.

Terry J. Reedy
 
B

beliavsky

Terry Reedy said:
The interpreter does not do optimization. 'Optimizing' complilers are
often buggy, which is to say, grossy dis-optimized -- as has been too often
discovered, from reports here and on PyDev, when compiling the interpreter
with 'optimzing' C compilers.

My experience with Fortran compilers is that the ones that can
aggressively optimize are not more buggy than the other ones, perhaps
in part because the fast compilers attract more users, who report more
bugs that are ironed out. In particular, Compaq Visual Fortran is a
highly optimizing, non-buggy compiler that handily beats Python in
speed comparisons of numerical code. C is more difficult to optimize
than Fortran because of its use of pointers. Optimizing compilers
always have a switch that let you compile with no optimization. When
using optimization, the user should check that results are the same as
with no optimization.

Speed is secondary to correctness, appropriately defined.

In production use, both are often vitally important. If you are
modelling a lot of data, a slow compiler or programming language may
force you to use an oversimplified model.
 
P

Peter Hansen

In production use, both are often vitally important. If you are
modelling a lot of data, a slow compiler or programming language may
force you to use an oversimplified model.

Terry's still right: correctness comes *first*, even if both are
"vitally important". Only once you have the thing correct (as proven by
passing tests) should you focus on speed.

This leaves aside the possibility that you should consider speed before
even picking your programming language, architecture, hardware, and
such, which of course for a problem requiring *real* speed would come
first. This thread was about a small detail of Python performance and
the possibility of having the compiler do some optimizing, however, and
these are not concerns that would show up in the initial stage of such a
project.

-Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,184
Messages
2,570,973
Members
47,529
Latest member
JaclynShum

Latest Threads

Top