Performance of int/long in Python 3

S

Steven D'Aprano

Ian Foote:


Many common string operations do not require indexing by character
which reduces the impact of this inefficiency.

Which common string operations do you have in mind?

Specifically in Python's case, the most obvious counter-example is the
length of a string. But that's only because Python strings are immutable
objects, and include a field that records the length. So once the string
is created, checking its length takes constant time.

Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.

UTF-8 seems like a
reasonable choice for an internal representation to me.

It's not. Haskell, for example, uses UTF-8 internally, and it warns that
this makes string operations O(N) instead of O(1) precisely because of
the need to walk the string inspecting every byte.

Remember, when your string primitives are O(N), it is very easy to write
code that becomes O(N**2). Using UTF-8 internally is just begging for
user-written code to be O(N**2).

One benefit of
UTF-8 over Python's flexible representation is that it is, on average,
more compact over a wide set of samples.

Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.
 
J

jmfauth

-----

You really REALLY need to sort out in your head the difference between
correctness and performance. I still haven't seen one single piece of
evidence from you that Python 3.3 fails on any point of Unicode
correctness.

That's because you are not understanding unicode. Unicode takes
you from the character to the unicoded transformed fomat via
the code point, working with a unique set of characters with
a contigoous range of code points.
Then it is up to the "implementors" (languages, compilers, ...)
to implement this utf.
Covering the whole range of Unicode has never been a
problem.

.... for all those, who are following the scheme explained above.
And it magically works smoothly. Of course, there are some variations
due to the Character Encoding Form wich is later influenced by the
Character Encoding Scheme (the serialization of the character Encoding
Scheme).

Rough explanation in other words.
I does not matter if you are using utf-8, -16, -32, ucs2 or ucs4.
All the single characters are handled in the same way with the "same
algorithm".

---

The flexible string representation takes the problem from the
other side, it attempts to work with the characters by using
their representations and it (can only) fails...

PS I never propose to use utf-8. I only spoke about utf-8
as an example. If you start to discuss indexing, you are off-topic.

jmf
 
J

jmfauth

Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.

This flexible string representation is so absurd that not only
"it" does not know you can not write Western European Languages
with latin-1, "it" penalizes you by just attempting to optimize
latin-1. Shown in my multiple examples.

(This is a similar case of the long and short int question/dicussion
Chris Angelico opened).


PS1: I received plenty of private mails. I'm suprise, how the dev
do not understand unicode.

PS2: Question I received once from a registrated French Python
Developper (in another context). What are those French characters
you can handle with cp1252 and not with latin-1?

jmf
 
C

Chris Angelico

This flexible string representation is so absurd that not only
"it" does not know you can not write Western European Languages
with latin-1, "it" penalizes you by just attempting to optimize
latin-1. Shown in my multiple examples.

PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?) Both are
optimized down to a single byte per character.

Option 2 is optimized to two bytes per character.

Option 3 is stored in UTF-32.

Once again, jmf, you are forgetting that option 2 is a safe and
bug-free optimization.

ChrisA
 
M

MRAB

Ian Foote:


Many common string operations do not require indexing by character
which reduces the impact of this inefficiency. UTF-8 seems like a
reasonable choice for an internal representation to me. One benefit
of UTF-8 over Python's flexible representation is that it is, on
average, more compact over a wide set of samples.
Implementing the regex module (http://pypi.python.org/pypi/regex) would
have been more difficult if the internal representation had been UTF-8,
because of the need to decode, and the implementation would also have
been slower for that reason.
 
C

Chris Angelico

Implementing the regex module (http://pypi.python.org/pypi/regex) would
have been more difficult if the internal representation had been UTF-8,
because of the need to decode, and the implementation would also have
been slower for that reason.

In fact, nearly ALL string parsing operations would need to be done
differently. The only method that I can think of that wouldn't be
impacted is a linear state-machine parser - something that could be
written inside a "for character in string" loop.

text = []

def initial(c):
global state
if c=='<': state=tag
else: text.append(c)

def tag(c):
global state
if c=='>': state=initial

state = initial
for character in string:
state(character)

print(''.join(text))


I'm pretty sure this will run in O(N) time, even with UTF-8 strings.
But it's an *extremely* simple parser.

ChrisA
 
J

jmfauth

PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?) Both are
optimized down to a single byte per character.

Option 2 is optimized to two bytes per character.

Option 3 is stored in UTF-32.

Once again, jmf, you are forgetting that option 2 is a safe and
bug-free optimization.

ChrisA

As long as you are attempting to devide a set of characters in
chunks and try to handle them seperately, it will never work.

Read my previous post about the unicode transformation format.
I know what pep393 does.

jmf
 
C

Chris Angelico

As long as you are attempting to devide a set of characters in
chunks and try to handle them seperately, it will never work.

Okay. Let's look at integers. To properly represent the Python 3 'int'
type (or the Python 2 'long'), we need to be able to encode ANY
integer. And of course, any attempt to divide them up into chunks will
never work. So we need a single representation that will cover ANY
integer, right?

Perfect. We already have one of those, detailed in RFC 2795. (It's
coming up to its thirteenth anniversary in a day or two. Very
appropriate.)

http://tools.ietf.org/html/rfc2795#section-4

Are you saying Python's integers should be stored as I-TAGs?

ChrisA
 
J

jmfauth

As long as you are attempting to devide a set of characters in
chunks and try to handle them seperately, it will never work.

Read my previous post about the unicode transformation format.
I know what pep393 does.

jmf

Addendum.

This was you correctly percieved in one another thread.
You qualified it as a "switch". Now you have to understand
from where this "switch" is coming from.

jmf

by toy with
 
I

Ian Kelly

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.

Go does this. I remember because it came up in one of these threads,
where jmf (or was it Ranting Rick?) was praising Go for just getting
Unicode "right".
 
T

Terry Reedy

PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?)

Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
amounts to a simple copy of the internal bytes. I do not know if *all*
the codecs for such encodings are 393-aware, but I do know that the
utf-8 and latin-1 group are. This is one operation that 3.3+ does much
faster than 3.2-
 
I

Ian Kelly

PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?) Both are
optimized down to a single byte per character.

The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header. The smaller header omits the utf8
pointer (which optionally points to an additional UTF-8 representation
of the string) and its associated length variable. These are not
needed for ASCII-only strings because an ASCII string can be directly
interpreted as a UTF-8 string for the same result. The smaller header
also omits the "wstr_length" field which, according to the PEP,
"differs from length only if there are surrogate pairs in the
representation." For an ASCII string, of course there would not be
any surrogate pairs.
 
C

Chris Angelico

Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
amounts to a simple copy of the internal bytes. I do not know if *all* the
codecs for such encodings are 393-aware, but I do know that the utf-8 and
latin-1 group are. This is one operation that 3.3+ does much faster than
3.2-

Thanks Terry. So that's not so much a representation difference as a
flag that costs little or nothing to retain, and can improve
performance in the encode later on. Sounds like a useful tweak to the
basics of flexible string representation, without being particularly
germane to jmf's complaints.

ChrisA
 
I

Ian Kelly

The flexible string representation takes the problem from the
other side, it attempts to work with the characters by using
their representations and it (can only) fails...

This is false. As I've pointed out to you before, the FSR does not
divide characters up by representation. It divides them up by
codepoint -- more specifically, by the *bit-width* of the codepoint.
We call the internal format of the string "ASCII" or "Latin-1" or
"UCS-2" for conciseness and a point of reference, but fundamentally
all of the FSR formats are simply byte arrays of *codepoints* -- you
know, those things you keep harping on. The major optimization
performed by the FSR is to consistently truncate the leading zero
bytes from each codepoint when it is possible to do so safely. But
regardless of to what extent this truncation is applied, the string is
*always* internally just an array of codepoints, and the same
algorithms apply for all representations.
 
J

jmfauth

Chris,

Your problem with int/long, the start of this thread, is
very intersting.

This is not a demonstration, a proof, rather an illustration.

Assume you have a set of integers {0...9} and an operator,
let say, the addition.

Idea.
Just devide this set in two chunks, {0...4} and {5...9}
and work hardly to optimize the addition of 2 operands in
the sets {0...4}.

The problems.
- When optimizing "{0...4}", your algorithm will most probably
weaken "{5...9}".
- When using "{5...9}", you do not benefit from your algorithm, you
will be penalized just by the fact you has optimized "{0...4}"
- And the first mistake, you are just penalized and impacted by the
fact you have to select in which subset you operands are when
working with "{0...9}".

Very interestingly, working with the representation (bytes) of
these integers will not help. You have to consider conceptually
{0..9} as numbers.

Now, replace numbers by characters, bytes by "encoded code points",
and you have qualitatively the flexible string representation.

In Unicode, there is one more level of abstraction: one conceptually
neither works with characters, nor with "encoded code points", but
with unicode transformed formated "entities". (see my previous post).

That means you can work very hardly on the "bytes levels",
you will never solves the problem which is one level higher
in the unicode hierarchy:
character -> code point -> utf -> bytes (implementation)
with the important fact that this construct can only go
from left to right.

---

In fact, by proposing a flexible representation of ints, you may
just fall in the same trap the flexible string representation
presents.

----

All this stuff is explained in good books about the coding of the
characters and/or unicode.
The unicode.org documention explains it too. It is a little
bit harder to discover, because the doc is presenting always
this stuff from a "technical" perspective.
You get it when reading a large part of the Unicode doc.

jmf
 
C

Chris Angelico

Assume you have a set of integers {0...9} and an operator,
let say, the addition.

Idea.
Just devide this set in two chunks, {0...4} and {5...9}
and work hardly to optimize the addition of 2 operands in
the sets {0...4}.

The problems.
- When optimizing "{0...4}", your algorithm will most probably
weaken "{5...9}".
- When using "{5...9}", you do not benefit from your algorithm, you
will be penalized just by the fact you has optimized "{0...4}"
- And the first mistake, you are just penalized and impacted by the
fact you have to select in which subset you operands are when
working with "{0...9}".

Very interestingly, working with the representation (bytes) of
these integers will not help. You have to consider conceptually
{0..9} as numbers.

Yeah, and there's an easy representation of those numbers. But let's
look at Python's representations of integers. I have a sneaking
suspicion something takes note of how large the number is before
deciding how to represent it. Look!
16474

Small numbers are represented more compactly than large ones! And it's
not like in REXX, where all numbers are stored as strings.

Go fork CPython and make the changes you suggest. Then run real-world
code on it and see how it performs. Or at very least, run plausible
benchmarks like the strings benchmark from the standard tests.

My original post about integers was based on two comparisons: Python 2
and Pike. Both languages have an optimization for "small" integers
(where "small" is "within machine word" - on rechecking some of my
stats, I find that I perhaps should have used a larger offset, as the
64-bit Linux Python I used appeared to be a lot faster than it should
have been), which Python 3 doesn't have. Real examples, real
statistics, real discussion. (I didn't include Pike stats in what I
posted, for two reasons: firstly, it would require a reworking of the
code, rather than simply "run this under both interpreters"; and
secondly, Pike performance is completely different from CPython
performance, and is non-comparable. Pike is more similar to PyPy, able
to compile - in certain circumstances - to machine code. So the
comparisons were Py2 vs Py3.)

ChrisA
 
J

jmfauth

This is false.  As I've pointed out to you before, the FSR does not
divide characters up by representation.  It divides them up by
codepoint -- more specifically, by the *bit-width* of the codepoint.
We call the internal format of the string "ASCII" or "Latin-1" or
"UCS-2" for conciseness and a point of reference, but fundamentally
all of the FSR formats are simply byte arrays of *codepoints* -- you
know, those things you keep harping on.  The major optimization
performed by the FSR is to consistently truncate the leading zero
bytes from each codepoint when it is possible to do so safely.  But
regardless of to what extent this truncation is applied, the string is
*always* internally just an array of codepoints, and the same
algorithms apply for all representations.

-----

You know, we can discuss this ad nauseam. What is important
is Unicode.

You have transformed Python back in an ascii oriented product.

If Python had imlemented Unicode correctly, there would
be no difference in using an "a", "é", "€" or any character,
what the narrow builds did.

If I am practically the only one, who speakes /discusses about
this, I can ensure you, this has been noticed.

Now, it's time to prepare the Asparagus, the "jambon cru"
and a good bottle a dry white wine.

jmf
 
C

Chris Angelico

If Python had imlemented Unicode correctly, there would
be no difference in using an "a", "é", "€" or any character,
what the narrow builds did.

I'm not following your grammar perfectly here, but if Python were
implementing Unicode correctly, there would be no difference between
any of those characters, which is the way a *wide* build works. With a
narrow build, there is a difference between BMP and non-BMP
characters.

ChrisA
 
R

rurpy

Of course not. By definition, if it helps, it wasn't *excessively* strong
language.

For someone who delights in pointing out the logical errors
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.
Insults are not ad hominem attacks.

Insults may or may not be ad hominem attacks. There is nothing
mutually exclusive about those terms.
"You sir, are a bounder and a cad. Furthermore, your
argument is wrong, because of reasons."

may very well be an insult, but it also may be correct, and the reasons
logically valid.

Those are two different statements. The first is an ad hominem
attack and is not welcome here. The second is an acceptable
response.
"Your argument is wrong, because you are a bounder
and a cad."

is an ad hominem fallacy, because even bounders and cads may tell the
truth occasionally, or be correct by accident.

That it is a fallacy does not mean it is not also an attack.
I find it interesting that nobody has yet felt the need to defend JMF,
and tell me I was factually incorrect about him (as opposed to merely
impolite or mean-spirited).

Nothing "interesting" about it at all. Most of us (perhaps
unlike you) are not interested in discussing the personal
characteristics of posters here (in contrast to discussing
the technical opinions they post).

Further, "liar" is both so non-objective and so pejoratively
emotive that it is a word much more likely to be used by
someone interested in trolling than in a serious discussion,
so most sensible people here likely would not bite.
[...]
I would rather that you called me a liar to my face
and gave me the opportunity to respond, than for you to ignore everything
I said.

Even if you personally would prefer someone to respond by
calling you a liar, your personal preferences do not form
a basis for desirable posting behavior here.

But again you're creating a false dichotomy. Those are not
the only two choices. A third choice is neither ignore you
nor call you a liar but to factually point out where you are
wrong, or (if it is a matter of opinion) why one holds a
different opinion. That was the point Ned Deily was making
I believe.
I hope that we all agree that we want a nice, friendly, productive
community where everyone is welcome.

I hope so too but it is likely that some people want a place
to develop and assert some sense of influence, engage in verbal
duels, instigate arguments, etc. That can be true of regulars
here as well as drive-by posters.
But some people simply cannot or
will not behave in ways that are compatible with those community values.
There are some people whom we *do not want here*

In other words, everyone is NOT welcome.
-- spoilers and messers,
vandals and spammers and cheats and liars and trolls and crackpots of all
sorts.

Where those terms are defined by you and a handful of other
voracious posters. "Troll" in particular is often used to
mean someone who disagrees with the borg mind here, or who
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in
a sufficiently submissive way.
We only disagree as to the best way to make it clear to them that
they are not welcome so long as they continue their behaviour.

No, we disagree on who fits those definitions and even
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original
posts are.
[1] Although sadly, given the reality of communication on the Internet,
sometimes kill-filing is the least-worst option.

Please, please, killfile jmfauth, ranting rick, xaw lee and
anyone else you don't like so that the rest of us can be spared
the orders of magnitude larger, more disruptive and more offensive
posts generated by your (plural) responses to them.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think.
 
N

Ned Deily

I'd rather this list have some vinegar than it devolve into
uselessness. Or, worse, if there's a hard-and-fast rule about
courtesy, devolve into aspartame... everyone's courteous in words but
hates each other underneath. Or am I taking the analogy too far? :)

I think you are positing false choices. No one - at least I'm not - is
advocating to avoid challenging false or misleading statements in the
interests of maintaining some false "see how well we all get along"
facade. The point is we can have meaningful, hard-nosed discussions
without resorting to personal insults, i.e. flaming. I think the
discussion in this topic over the past 24 hours or so demonstrates that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top