Performance of int/long in Python 3

rusi · Apr 3, 2013

Its a reference to a comment by Jamie Zawinski (relatively famous
developer of Netscape Navigator and other things):

And Xemacs (which is famous in the free sw world for other things!)

"Every program attempts to expand until it can read mail. Those
programs which cannot so expand are replaced by ones which can."

Ok got it

Steven D'Aprano · Apr 3, 2013

Sorting a million string list (all the file paths on a particular
computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so
we're out of the 'not noticeable by humans' range. Perhaps this is still
a 'micro-benchmark' - I'd just like to avoid adding email access to get
this over the threshold.

I cannot confirm this performance regression. On my laptop (Debian Linux,
not Windows), I can sort a million file names in approximately 1.2
seconds in both Python 3.2 and 3.3. There is no meaningful difference in
speed between the two versions.

Terry Jan Reedy · Apr 3, 2013

What system *and* what compiler and compiler options. Unless 3.2 and 3.3
are both compiler with the same compiler and settings, we do not know
the source of the difference.

I cannot confirm this performance regression. On my laptop (Debian Linux,
not Windows), I can sort a million file names in approximately 1.2
seconds in both Python 3.2 and 3.3. There is no meaningful difference in
speed between the two versions.

I am guessing that Neil's undisclosed system (that I can see) is
Windows, since other benchmarks have been more different on Windows than
on *nix. Given that we *know* that the 3.2 and 3.3 distribution are
compiled with different compilers and run with different C runtimes, it
is possible that some of the difference is from that and not from python
at all.

tjr

Chris Angelico · Apr 3, 2013

rusi wrote:
"Every program attempts to expand until it can read mail. Those programs
which cannot so expand are replaced by ones which can."

In my personal experience, it's calculators. I put command-line
calculators into *everything*... often in the form of more general
executors, and thus restricted to admins, but it's still a calculator.

For some reason, the ability to type "calc 1+2" and get back 3 is very
satisfying to me. You know, in case I ever forget what one plus two
makes.

ChrisA

Chris Angelico · Apr 3, 2013

I cannot confirm this performance regression. On my laptop (Debian Linux,
not Windows), I can sort a million file names in approximately 1.2
seconds in both Python 3.2 and 3.3. There is no meaningful difference in
speed between the two versions.

I'd be curious to know the sorts of characters used. Given that it's
probably a narrow-vs-wide Python difference we're talking here, the
actual distribution of codepoints may well make a difference.

ChrisA

Chris Angelico · Apr 3, 2013

Chris Angelico:

I was going to upload it but then I thought of potential client
-confidentiality problems and the need to audit a list that long.

Hmm. I was about to say "Can you just do a quick collections.Counter()
of the string widths in 3.3, as an easy way of seeing which ones use
BMP or higher characters", but I can't find a simple way to query a
string's width. Can't see it as a method of the string object, nor in
the string or sys modules. It ought to be easy enough at the C level -
just look up the two bits representing 'kind' - but I've not found it
exposed to Python. Is there anything?

ChrisA

Ian Kelly · Apr 3, 2013

Hmm. I was about to say "Can you just do a quick collections.Counter()
of the string widths in 3.3, as an easy way of seeing which ones use
BMP or higher characters", but I can't find a simple way to query a
string's width. Can't see it as a method of the string object, nor in
the string or sys modules. It ought to be easy enough at the C level -
just look up the two bits representing 'kind' - but I've not found it
exposed to Python. Is there anything?

4 if max(map(ord, s)) > 0xffff else 2 if max(map(ord, s)) > 0xff else 1

Chris Angelico · Apr 3, 2013

4 if max(map(ord, s)) > 0xffff else 2 if max(map(ord, s)) > 0xff else 1

Yeah, that's iterating over the whole string (twice, if it isn't width
4). The system already knows what the size is, I was hoping for an
uber-quick inspection of the string header.

ChrisA

Steven D'Aprano · Apr 3, 2013

Yeah, that's iterating over the whole string (twice, if it isn't width
4).

Then don't write it as a one-liner

n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

Here's another way:

(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

should work.

There's probably also a way to do it using ctypes.

The system already knows what the size is, I was hoping for an
uber-quick inspection of the string header.

I'm not sure that I would want strings to have a method reporting this,
but it might be nice to have a function in the inspect module to do so.

Chris Angelico · Apr 3, 2013

Here's another way:

(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

should work.

Hmm, I had been under the impression that there was a certain "base
length" below which strings all had the same size. Yes, that also
works; though again, it's something that can be directly queried, at
the C level.

There's probably also a way to do it using ctypes.

I'm not sure that I would want strings to have a method reporting this,
but it might be nice to have a function in the inspect module to do so.

Yeah, that's why I also looked in 'sys'; 'inspect' might well be a
good place for it, too. But it seems such a function doesn't exist,
which is what I was asking.

ChrisA

rusi · Apr 3, 2013

Reran the programs taking a bit more care with the encoding of the
file. This had no effect on the speeds. There are only a small amount of
paths that don't fit into ASCII:

ASCII 1076101
Latin1 218
BMP 113
Astral 0

# encoding:utf-8
import codecs, os, time
from os.path import join, getsize
with codecs.open("filelist.txt", "r", "utf-8") as f:
paths = f.read().split("\n")
bucket = [0,0,0,0]
for p in paths:
b = 0
maxChar = max([ord(ch) for ch in p])
if maxChar >= 65536:
b = 3
elif maxChar >= 256:
b = 2
elif maxChar >= 128:
b = 1
bucket = bucket + 1
print("ASCII", bucket[0])
print("Latin1", bucket[1])
print("BMP", bucket[2])
print("Astral", bucket[3])

Neil

Can you please try one more experiment Neil?
Knock off all non-ASCII strings (paths) from your dataset and try
again.

[It should take little more than converting your above code to a
filter:
if b == 0: print
if b > 0: ignore
]

jmfauth · Apr 3, 2013

--------

This FSR is wrong by design. A naive way to embrace Unicode.

jmf

Dave Angel · Apr 3, 2013

rusi:

Results are the same 0.40 (well, 0.001 less but I don't think the
timer is that accurate) for Python 3.2 and 0.78 for Python 3.3.

Neil

That would seem to imply that the speed regression on your data is NOT
caused by the differing size encodings. Perhaps it is the difference in
MSC compiler version, or other changes made between 3.2 and 3.3

Of course, I can't then explain why Steven didn't get the same results.
Perhaps the difference between 32bit Python and 64 on Windows? Or
perhaps you have significantly more (or significantly fewer)
"collisions" than Steven did.

Before I saw this message, I was thinking of suggesting that you supply
a key= parameter to sort, specifying as a key the Unicode character
65536 higher than the one supplied. That way all the keys to be sorted
would be 32 bits in size. If this made the timings change noticeably,
it could be a big clue.

Mark Lawrence · Apr 3, 2013

--------

This FSR is wrong by design. A naive way to embrace Unicode.

jmf

The hole you're digging for yourself is getting bigger and bigger and
I'm loving it

Dave Angel · Apr 3, 2013

Dave Angel:

That would seem to imply that the speed regression on your data is NOT
caused by the differing size encodings. Perhaps it is the difference in
MSC compiler version, or other changes made between 3.2 and 3.3

Click to expand...

Its not caused by there actually being different size encodings but
that the code is checking encoding size 2-4 times for each character.

Back in 3.2 the comparison loop looked like:

while (len1 > 0 && len2 > 0) {
Py_UNICODE c1, c2;

c1 = *s1++;
c2 = *s2++;

if (c1 != c2)
return (c1 < c2) ? -1 : 1;

len1--; len2--;
}

For 3.3 this has changed to

for (i = 0; i < len1 && i < len2; ++i) {
Py_UCS4 c1, c2;
c1 = PyUnicode_READ(kind1, data1, i);
c2 = PyUnicode_READ(kind2, data2, i);

if (c1 != c2)
return (c1 < c2) ? -1 : 1;
}

with PyUnicode_READ being

#define PyUnicode_READ(kind, data, index) \
((Py_UCS4) \
((kind) == PyUnicode_1BYTE_KIND ? \
((const Py_UCS1 *)(data))[(index)] : \
((kind) == PyUnicode_2BYTE_KIND ? \
((const Py_UCS2 *)(data))[(index)] : \
((const Py_UCS4 *)(data))[(index)] \
) \
))

There are either 1 or 2 kind checks in each call to PyUnicode_READ
and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to
move the kind checks out of the loop and specialize the loop but MSVC
2010 appears to not do so.

I don't know how good MSC's template logic is, but it seems this would
be a good case for an explicit template, typed on the 'kind's values.
Or are all C++ features disabled when compiling Python? Failing that,
just code up 9 cases, and do a switch on the kinds.

I'm also puzzled. I thought that the sort algorithm used a hash of all
the items to be sorted, and only reverted to a raw comparison of the
original values when the hash collided. Is that not the case? Or is
the code you post here only used when the hash collides?

The assembler (32-bit build) for each

PyUnicode_READ looks like

mov ecx, DWORD PTR _kind1$[ebp]
cmp ecx, 1
jne SHORT $LN17@unicode_co@2
lea ecx, DWORD PTR [ebx+eax]
movzx edx, BYTE PTR [ecx+edx]
jmp SHORT $LN16@unicode_co@2
$LN17@unicode_co@2:
cmp ecx, 2
jne SHORT $LN15@unicode_co@2
movzx edx, WORD PTR [ebx+edi]
jmp SHORT $LN16@unicode_co@2
$LN15@unicode_co@2:
mov edx, DWORD PTR [ebx+esi]
$LN16@unicode_co@2:

It appears that the compiler is keeping the three pointers in three
separate registers (eax, esi and edi) even though those are 3 aliases
for the same pointer. This is preventing it from putting other values
in those registers.

It'd probably do better if the C code manipulated the pointers, rather
than using an index i each time. But if it did, perhaps gcc would
generate worse code.

If I were coding the assembler by hand (Intel only), I'd be able to
avoid the multiple cmp operations, simply by comparing first to 2, then
doing a jne and a ja. I dunno whether the compiler would notice if I
coded the equivalent in C. (make both comparisons to 2, one for less,
and one for more)

The kind1/kind2 variables aren't even going into registers and at
least one test+branch and a jump are executed for every character. Two
tests for 2 and 4 byte kinds. len1 and len2 don't get to go into
registers either.

Here's the full assembler output for unicode_compare:

; COMDAT _unicode_compare
_TEXT SEGMENT
_kind2$ = -20 ; size = 4
_kind1$ = -16 ; size = 4
_len2$ = -12 ; size = 4
_len1$ = -8 ; size = 4
_data2$ = -4 ; size = 4
_unicode_compare PROC ; COMDAT
; _str1$ = ecx
; _str2$ = eax

; 10417: {

push ebp
mov ebp, esp
sub esp, 20 ; 00000014H
push ebx
push esi
mov esi, eax

; 10418: int kind1, kind2;
; 10419: void *data1, *data2;
; 10420: Py_ssize_t len1, len2, i;
; 10421:
; 10422: kind1 = PyUnicode_KIND(str1);

mov eax, DWORD PTR [ecx+16]
mov edx, eax
shr edx, 2
and edx, 7
push edi
mov DWORD PTR _kind1$[ebp], edx

; 10423: kind2 = PyUnicode_KIND(str2);

mov edx, DWORD PTR [esi+16]
mov edi, edx
shr edi, 2
and edi, 7
mov DWORD PTR _kind2$[ebp], edi

; 10424: data1 = PyUnicode_DATA(str1);

test al, 32 ; 00000020H
je SHORT $LN9@unicode_co@2
test al, 64 ; 00000040H
je SHORT $LN7@unicode_co@2
lea ebx, DWORD PTR [ecx+24]
jmp SHORT $LN10@unicode_co@2
$LN7@unicode_co@2:
lea ebx, DWORD PTR [ecx+36]
jmp SHORT $LN10@unicode_co@2
$LN9@unicode_co@2:
mov ebx, DWORD PTR [ecx+36]
$LN10@unicode_co@2:

; 10425: data2 = PyUnicode_DATA(str2);

test dl, 32 ; 00000020H
je SHORT $LN13@unicode_co@2
test dl, 64 ; 00000040H
je SHORT $LN11@unicode_co@2
lea edx, DWORD PTR [esi+24]
jmp SHORT $LN30@unicode_co@2
$LN11@unicode_co@2:
lea eax, DWORD PTR [esi+36]
mov DWORD PTR _data2$[ebp], eax
mov edx, eax
jmp SHORT $LN14@unicode_co@2
$LN13@unicode_co@2:
mov edx, DWORD PTR [esi+36]
$LN30@unicode_co@2:
mov DWORD PTR _data2$[ebp], edx
$LN14@unicode_co@2:

; 10426: len1 = PyUnicode_GET_LENGTH(str1);

mov edi, DWORD PTR [ecx+8]

; 10427: len2 = PyUnicode_GET_LENGTH(str2);

mov ecx, DWORD PTR [esi+8]

; 10428:
; 10429: for (i = 0; i < len1 && i < len2; ++i) {

xor eax, eax
mov DWORD PTR _len1$[ebp], edi
mov DWORD PTR _len2$[ebp], ecx
test edi, edi
jle SHORT $LN2@unicode_co@2

; 10426: len1 = PyUnicode_GET_LENGTH(str1);

mov esi, edx
mov edi, edx

; 10428:
; 10429: for (i = 0; i < len1 && i < len2; ++i) {

sub ebx, edx
jmp SHORT $LN4@unicode_co@2
$LL28@unicode_co@2:
mov edx, DWORD PTR _data2$[ebp]
$LN4@unicode_co@2:
cmp eax, ecx
jge SHORT $LN29@unicode_co@2

; 10430: Py_UCS4 c1, c2;
; 10431: c1 = PyUnicode_READ(kind1, data1, i);

mov ecx, DWORD PTR _kind1$[ebp]
cmp ecx, 1
jne SHORT $LN17@unicode_co@2
lea ecx, DWORD PTR [ebx+eax]
movzx edx, BYTE PTR [ecx+edx]
jmp SHORT $LN16@unicode_co@2
$LN17@unicode_co@2:
cmp ecx, 2
jne SHORT $LN15@unicode_co@2
movzx edx, WORD PTR [ebx+edi]
jmp SHORT $LN16@unicode_co@2
$LN15@unicode_co@2:
mov edx, DWORD PTR [ebx+esi]
$LN16@unicode_co@2:

; 10432: c2 = PyUnicode_READ(kind2, data2, i);

mov ecx, DWORD PTR _kind2$[ebp]
cmp ecx, 1
jne SHORT $LN21@unicode_co@2
mov ecx, DWORD PTR _data2$[ebp]
movzx ecx, BYTE PTR [eax+ecx]
jmp SHORT $LN20@unicode_co@2
$LN21@unicode_co@2:
cmp ecx, 2
jne SHORT $LN19@unicode_co@2
movzx ecx, WORD PTR [edi]
jmp SHORT $LN20@unicode_co@2
$LN19@unicode_co@2:
mov ecx, DWORD PTR [esi]
$LN20@unicode_co@2:

; 10433:
; 10434: if (c1 != c2)

cmp edx, ecx
jne SHORT $LN31@unicode_co@2
mov ecx, DWORD PTR _len2$[ebp]
inc eax
add edi, 2
add esi, 4
cmp eax, DWORD PTR _len1$[ebp]
jl SHORT $LL28@unicode_co@2
$LN29@unicode_co@2:
mov edi, DWORD PTR _len1$[ebp]
$LN2@unicode_co@2:

; 10436: }
; 10437:
; 10438: return (len1 < len2) ? -1 : (len1 != len2);

cmp edi, ecx
jge SHORT $LN23@unicode_co@2
pop edi
pop esi
or eax, -1
pop ebx

; 10439: }

mov esp, ebp
pop ebp
ret 0
$LN31@unicode_co@2:

; 10435: return (c1 < c2) ? -1 : 1;

sbb eax, eax
pop edi
and eax, -2 ; fffffffeH
pop esi
inc eax
pop ebx

; 10439: }

mov esp, ebp
pop ebp
ret 0
$LN23@unicode_co@2:

; 10436: }
; 10437:
; 10438: return (len1 < len2) ? -1 : (len1 != len2);

xor eax, eax
cmp edi, ecx
pop edi
pop esi
setne al
pop ebx

; 10439: }

mov esp, ebp
pop ebp
ret 0
_unicode_compare ENDP

Neil

Roy Smith · Apr 3, 2013

Neil Hodgson said:
Roy Smith:

About 2 minutes. But that's just getting an example data set. Other
data sets may be loaded more quickly from databases or files or be
created by processing. Reading the example data from a file takes around
the same time as sorting.

Fair enough. In fact, given that reading the file from disk is O(n) and
sorting it is O(n log n), at some point, the sort will totally swamp the
input time. Your original example just happened to be one of the
unusual cases where the sort time is not the rate limiting factor in the
overall process.

I remember reading somewhere that more CPU cycles in the entire history
of computing have been spend doing sorting than anything else.

Roy Smith · Apr 3, 2013

Chris Angelico said:
In my personal experience, it's calculators. I put command-line
calculators into *everything*... often in the form of more general
executors, and thus restricted to admins, but it's still a calculator.

For some reason, the ability to type "calc 1+2" and get back 3 is very
satisfying to me. You know, in case I ever forget what one plus two
makes.

I discovered recently that Spotlight (the OSX built-in search engine)
can do this.

Chris Angelico · Apr 3, 2013

Fair enough. In fact, given that reading the file from disk is O(n) and
sorting it is O(n log n), at some point, the sort will totally swamp the
input time.

But given the much larger fixed cost of disk access, that might take
an awful lot of strings...

ChrisA

Chris Angelico · Apr 3, 2013

I discovered recently that Spotlight (the OSX built-in search engine)
can do this.

Good feature, not surprising. Google Search has had that feature for a
while, and it just "feels right" to be able to look up information the
same way regardless of its source.

ChrisA

Roy Smith · Apr 3, 2013

Steven D'Aprano said:
Then don't write it as a one-liner

n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no? I posted (essentially) this
a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"

I'm reasonably sure all() is smart enough to stop at the first False
value.

(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

I wouldn't trust getsizeof() to return exactly what you're looking for.

Do you know any other interesting features about coding in Python?	5	Sep 17, 2023
range() vs xrange() Python2\|3 issues for performance	11	Aug 2, 2011
Python battle game help	2	Feb 23, 2023
Python code problem	2	Apr 23, 2023
Why is Python telling me variable is local not global?	3	Sep 2, 2023
Rock paper scissors in python with "algorithm"	1	Feb 27, 2022
performance of tight loop	8	Dec 14, 2010
performance of script to write very long lines of random chars	15	Apr 11, 2013

Performance of int/long in Python 3

rusi

Steven D'Aprano

Terry Jan Reedy

Chris Angelico

Chris Angelico

Chris Angelico

Ian Kelly

Chris Angelico

Steven D'Aprano

Chris Angelico

rusi

jmfauth

Dave Angel

Mark Lawrence

Dave Angel

Roy Smith

Roy Smith

Chris Angelico

Chris Angelico

Roy Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads