str.count is slow

C

chrisperkins99

Feb 27, 2006

#1

It seems to me that str.count is awfully slow. Is there some reason
for this?
Evidence:

######## str.count time test ########
import string
import time
import array

s = string.printable * int(1e5) # 10**7 character string
a = array.array('c', s)
u = unicode(s)
RIGHT_ANSWER = s.count('a')

def main():
print 'str: ', time_call(s.count, 'a')
print 'array: ', time_call(a.count, 'a')
print 'unicode:', time_call(u.count, 'a')

def time_call(f, *a):
start = time.clock()
assert RIGHT_ANSWER == f(*a)
return time.clock()-start

if __name__ == '__main__':
main()

###### end ########

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303

If a unicode object can count characters so fast, why should an str
object be ten times slower? Just curious, really - it's still fast
enough for me (so far).

This is with Python 2.4.1 on WinXP.

Chris Perkins

B

Ben Cartwright

Feb 27, 2006

#2

It seems to me that str.count is awfully slow. Is there some reason
for this?
Evidence:

######## str.count time test ########
import string
import time
import array

s = string.printable * int(1e5) # 10**7 character string
a = array.array('c', s)
u = unicode(s)
RIGHT_ANSWER = s.count('a')

def main():
print 'str: ', time_call(s.count, 'a')
print 'array: ', time_call(a.count, 'a')
print 'unicode:', time_call(u.count, 'a')

def time_call(f, *a):
start = time.clock()
assert RIGHT_ANSWER == f(*a)
return time.clock()-start

if __name__ == '__main__':
main()

###### end ########

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303

If a unicode object can count characters so fast, why should an str
object be ten times slower? Just curious, really - it's still fast
enough for me (so far).

This is with Python 2.4.1 on WinXP.

Chris Perkins

Your evidence points to some unoptimized code in the underlying C
implementation of Python. As such, this should probably go to the
python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

The problem is that the C library function memcmp is slow, and
str.count calls it frequently. See lines 2165+ in stringobject.c
(inside function string_count):

r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This could be optimized as:

r = 0;
while (i < m) {
if (s == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

The above might be optimized further for cases such as yours, where a
single character appears many times in the string:

r = 0;
if (n == 1) {
/* optimize for a single character */
while (i < m) {
if (s == *sub)
r++;
i++;
}
} else {
while (i < m) {
if (s == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}
}

Note that there might be some subtle reason why neither of these
optimizations are done that I'm unaware of... in which case a comment
in the C source would help.

--Ben

F

Fredrik Lundh

Feb 28, 2006

#3

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

it's about time that someone sat down and merged the string and unicode
implementations into a single "stringlib" code base (see the SRE sources for
an efficient way to do this in plain C).

moving to (basic) C++ might also be a good idea (in 3.0, perhaps). is any-
one still stuck with pure C89 these days ?

</F>

T

Terry Reedy

Feb 28, 2006

#4

Ben Cartwright said:
Your evidence points to some unoptimized code in the underlying C
implementation of Python. As such, this should probably go to the
python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

The problem is that the C library function memcmp is slow, and
str.count calls it frequently. See lines 2165+ in stringobject.c
(inside function string_count):

r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This could be optimized as:

r = 0;
while (i < m) {
if (s == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

If not doing the same in str.count is indeed an oversight. a patch should
be welcome (on the SF tracker).

file seek is slow	11	Mar 9, 2010
avro slow?	1	May 5, 2011
collections.Counter surprisingly slow	11	Jul 28, 2013
How to write fast into a file in python?	28	May 17, 2013
EEG stream data with mne and brainfolw	0	Jul 26, 2023
Translater + module + tkinter	1	Feb 16, 2023
Python battle game help	2	Feb 23, 2023
Passing flask textbox value to an infinite while loop	0	Jul 21, 2021

chrisperkins99

Ben Cartwright

Fredrik Lundh

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads