G
Gregor Lingl
"Quality of design" may - for instance - consist
in deleting 7 characters:
Today I made - by accident - an interesting
observation. The following is a snippet from
a program which computes anagrams from a wordlist:
def find_anagrams(wordlist):
d = {}
for word in wordlist:
sig = anagram_signature(word)
if sig not in d:
d[sig] = [] # d.setdefault intentionally not used
return d
On my machine it computes d
for a wordlist of 10000 items in less than 0.13s
and for a wordlist of 20000 items in less than 0.25s
Today one of my students inadvertedly wrote in the fifth line
if sig not in d.keys():
and now timing of find_anagrams had the following results:
for a wordlist of 10000 items approx. 9.33s, i.e a factor
of approx 65 slower
and for a wordlist of 20000 items approx. 60s, i.e. a factor
of approx 250 slower. (!!!) (Imagine that the full wordlist
in this example has 45000 entires, so an additional factor of
16 applies ...)
I think this is a very astonishing and relevant difference!
I think I understand why it is the case - the timing factor
is proportional to the square of len(wordlist), because
in every loop a list d.keys() has to be computed and now
this has to be used for searching for the key sig,
while for sig in d uses some built-in generator-property
of dictionaries ... (?)
*on_the_other_ hand* I thought, that those two idioms
are semantically identical. So my question: why does the
Python interpreter not recognize this identity and internally
replace the latter by the first in order to get *much*
better performance?
- would this lead to wrong results in some cases?
- or would it be too difficult to implement?
- or is this considered a problem of minor importance?
- or was it only forgotten to do?
- or what else?
Regards,
Gregor
P.S.: Or was the key in dict idiom introduced exactly
inorder to solve this problem?
in deleting 7 characters:
Today I made - by accident - an interesting
observation. The following is a snippet from
a program which computes anagrams from a wordlist:
def find_anagrams(wordlist):
d = {}
for word in wordlist:
sig = anagram_signature(word)
if sig not in d:
d[sig] = [] # d.setdefault intentionally not used
return d
On my machine it computes d
for a wordlist of 10000 items in less than 0.13s
and for a wordlist of 20000 items in less than 0.25s
Today one of my students inadvertedly wrote in the fifth line
if sig not in d.keys():
and now timing of find_anagrams had the following results:
for a wordlist of 10000 items approx. 9.33s, i.e a factor
of approx 65 slower
and for a wordlist of 20000 items approx. 60s, i.e. a factor
of approx 250 slower. (!!!) (Imagine that the full wordlist
in this example has 45000 entires, so an additional factor of
16 applies ...)
I think this is a very astonishing and relevant difference!
I think I understand why it is the case - the timing factor
is proportional to the square of len(wordlist), because
in every loop a list d.keys() has to be computed and now
this has to be used for searching for the key sig,
while for sig in d uses some built-in generator-property
of dictionaries ... (?)
*on_the_other_ hand* I thought, that those two idioms
are semantically identical. So my question: why does the
Python interpreter not recognize this identity and internally
replace the latter by the first in order to get *much*
better performance?
- would this lead to wrong results in some cases?
- or would it be too difficult to implement?
- or is this considered a problem of minor importance?
- or was it only forgotten to do?
- or what else?
Regards,
Gregor
P.S.: Or was the key in dict idiom introduced exactly
inorder to solve this problem?