Performance of int/long in Python 3

R

rusi

 Steven D'Aprano said:
[...]
OK, that leads to the next question.  Is there anyway I can (in Python
2.7) detect when a string is not entirely in the BMP?  If I could find
all the non-BMP characters, I could replace them with U+FFFD
(REPLACEMENT CHARACTER) and life would be good (enough).
Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just throw
it away unnecessarily. That would be rude, and in Python 3.3 it should be
unnecessary.

The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them.  I can live with that.
Sometimes practicality trumps correctness.

That works out to 0.000003%. Of course I assume it is US only data.
Still its good to know how skew the distribution is.
 
S

Steven D'Aprano

That works out to 0.000003%. Of course I assume it is US only data.
Still its good to know how skew the distribution is.

If the data included Japanese names, or used Emoji, it would be much
closer to 100% than 0.000003%.
 
S

Steven D'Aprano

Steven D'Aprano said:
[...]
OK, that leads to the next question. Is there anyway I can (in
Python 2.7) detect when a string is not entirely in the BMP? If I
could find all the non-BMP characters, I could replace them with
U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).

Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just
throw it away unnecessarily. That would be rude, and in Python 3.3 it
should be unnecessary.

The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them. I can live with that.
Sometimes practicality trumps correctness.

Well, true. It has to be said that few programming languages (and
databases) make it easy to do the right thing. On the other hand, you're
a programmer. Your job is to write correct code, not easy code.

It turns out, the problem is that the version of MySQL we're using

Well there you go. Why don't you use a real database?

http://www.postgresql.org/docs/9.2/static/multibyte.html

:)

Postgresql has supported non-broken UTF-8 since at least version 8.1.

doesn't support non-BMP characters. Newer versions do (but you have to
declare the column to use the utf8bm4 character set). I could upgrade
to a newer MySQL version, but it's just not worth it.

My brain just broke. So-called "UTF-8" in MySQL only includes up to a
maximum of three-byte characters. There has *never* been a time where
UTF-8 excluded four-byte characters. What were the developers thinking,
arbitrarily cutting out support for 50% of UTF-8?


Actually, I did try spinning up a 5.5 instance (one of the nice things
of being in the cloud) and experimented with that, but couldn't get it
to work there either. I'll admit that I didn't invest a huge amount of
effort to make that work before just writing this:

def bmp_filter(self, s):
"""Filter a unicode string to remove all non-BMP (basic
multilingual plane) characters. All such characters are
replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).

"""

I expect that in 5-10 years, applications that remove or mangle non-BMP
characters will be considered as unacceptable as applications that mangle
BMP characters. Or for that matter, applications that cannot handle names
with apostrophes.

Hell, if your customer base is in Asia, chances are that mangling non-BMP
characters is *already* considered unacceptable.
 
C

Chris Angelico

Well there you go. Why don't you use a real database?

http://www.postgresql.org/docs/9.2/static/multibyte.html

:)

Postgresql has supported non-broken UTF-8 since at least version 8.1.

Not only that, but I *rely* on PostgreSQL to test-or-reject stuff that
comes from untrustworthy languages, like PHP. If it's malformed in any
way, it won't get past the database.
My brain just broke. So-called "UTF-8" in MySQL only includes up to a
maximum of three-byte characters. There has *never* been a time where
UTF-8 excluded four-byte characters. What were the developers thinking,
arbitrarily cutting out support for 50% of UTF-8?

Steven, you punctuated that wrongly.

What, were the developers *thinking*? Arbitrarily etc?

It really is brain-breaking. I could understand a naive UTF-8 codec
being too permissive (allowing over-long encodings, allowing
codepoints above what's allocated (eg FA 80 80 80 80, which would
notionally represent U+2000000), etc), but why should it arbitrarily
stop short? There must have been some internal limitation - that,
perhaps, collation was defined only within the BMP.

ChrisA
 
M

MRAB

Steven D'Aprano said:
[...]
OK, that leads to the next question. Is there anyway I can (in
Python 2.7) detect when a string is not entirely in the BMP? If I
could find all the non-BMP characters, I could replace them with
U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).

Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just
throw it away unnecessarily. That would be rude, and in Python 3.3 it
should be unnecessary.

The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them. I can live with that.
Sometimes practicality trumps correctness.

Well, true. It has to be said that few programming languages (and
databases) make it easy to do the right thing. On the other hand, you're
a programmer. Your job is to write correct code, not easy code.

It turns out, the problem is that the version of MySQL we're using

Well there you go. Why don't you use a real database?

http://www.postgresql.org/docs/9.2/static/multibyte.html

:)

Postgresql has supported non-broken UTF-8 since at least version 8.1.

doesn't support non-BMP characters. Newer versions do (but you have to
declare the column to use the utf8bm4 character set). I could upgrade
to a newer MySQL version, but it's just not worth it.

My brain just broke. So-called "UTF-8" in MySQL only includes up to a
maximum of three-byte characters. There has *never* been a time where
UTF-8 excluded four-byte characters. What were the developers thinking,
arbitrarily cutting out support for 50% of UTF-8?
[snip]
50%? The BMP is one of 17 planes, so wouldn't that be 94%?
 
J

jmfauth

---------


I'm not whining or and I'm not complaining (and never did).
I always exposed facts.

I'm not especially interested in Python, I'm interested in
Unicode.

Usualy when I posted examples, there are confirmed.


What I see is this (std "download-abled" Python's on Windows 7 (and
other
Windows/platforms/machines):

Py32
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'") [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
timeit.repeat("'a' * 1000 + 'z'")
[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

Py33
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat("'a' * 1000 + 'z'")
[0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

I have systematically such a behaviour, in 99.99999% of my tests.
When there is something better, it is usually because something else
(3.2/3.3) has been modified.

I have my idea where this is coming from.

Question: When it is claimed, that this has been tested,
do you mean stringbench.py as proposed many times by Terry?
(Thanks for an answer).

jmf
 
C

Chris Angelico

Py32
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'") [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
timeit.repeat("'a' * 1000 + 'z'")
[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

Py33
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat("'a' * 1000 + 'z'")
[0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

This is what's called a microbenchmark. Can you show me any instance
in production code where an operation like this is done repeatedly, in
a time-critical place? It's a contrived example, and it's usually
possible to find regressions in any system if you fiddle enough with
the example. Do you have, for instance, a web server that can handle
1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?

ChrisA
 
M

Mark Lawrence

---------


I'm not whining or and I'm not complaining (and never did).
I always exposed facts.

The only fact I'm aware of is an edge case that is being addressed on
the Python bug tracker, sorry I'm too lazy to look up the number again.
I'm not especially interested in Python, I'm interested in
Unicode.

So why do you keep harping on about the same old edge case?
Usualy when I posted examples, there are confirmed.

The only thing you've ever posted are the same old boring micro
benchmarks. You never, ever comment on the memory savings that are IIRC
extremely popular with the Django folks amongst others. Neither do you
comment on the fact that the unicode implementation in Python 3.3 is
correct. I can only assume that you prefer a fast but buggy
implementation to a correct but slow one. Except that in many cases the
3.3 implementation is actually faster, so you conveniently ignore this.
What I see is this (std "download-abled" Python's on Windows 7 (and
other
Windows/platforms/machines):

Py32
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'") [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
timeit.repeat("'a' * 1000 + 'z'")
[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

Py33
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat("'a' * 1000 + 'z'")
[0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

I have systematically such a behaviour, in 99.99999% of my tests.

Always run on your micro benchmarks, never anything else.
When there is something better, it is usually because something else
(3.2/3.3) has been modified.

I have my idea where this is coming from.

I know where this is coming from as it's been stated umpteen times on
numerous threads. As usual you simply ignore any facts that you feel
like, particularly with respect to any real world use cases.
Question: When it is claimed, that this has been tested,
do you mean stringbench.py as proposed many times by Terry?
(Thanks for an answer).

I find it amusing that you ask for an answer but refuse point blank to
provide answers yourself. I suspect that you've bitten off more than
you can chew.
 
J

jmfauth

Py32
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
timeit.repeat("'a' * 1000 + 'z'")
[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]
Py33
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat("'a' * 1000 + 'z'")
[0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

This is what's called a microbenchmark. Can you show me any instance
in production code where an operation like this is done repeatedly, in
a time-critical place? It's a contrived example, and it's usually
possible to find regressions in any system if you fiddle enough with
the example. Do you have, for instance, a web server that can handle
1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?

ChrisA

-----

Of course this is an example, as many I gave. Examples you may find in
apps.

Can you point and give at least a bunch of examples, showing
there is no regression, at least to contradict me. The only
one I succeed to see (in month), is the one given by Steven, a status
quo.

I will happily accept them. The only think I read is "this is faster",
"it has been tested", ...

jmf
 
R

Roy Smith

Well, true. It has to be said that few programming languages (and
databases) make it easy to do the right thing. On the other hand, you're
a programmer. Your job is to write correct code, not easy code.

This is really getting off topic, but fundamentally, I'm an engineer.
My job is to build stuff that make money for my company. That means
making judgement calls about what's not worth fixing, because the cost
to fix it exceeds the value.
 
M

Mark Lawrence

Py32
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
timeit.repeat("'a' * 1000 + 'z'")
[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]
Py33
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat("'a' * 1000 + 'z'")
[0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

This is what's called a microbenchmark. Can you show me any instance
in production code where an operation like this is done repeatedly, in
a time-critical place? It's a contrived example, and it's usually
possible to find regressions in any system if you fiddle enough with
the example. Do you have, for instance, a web server that can handle
1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?

ChrisA

You've given many examples of the same type of micro benchmark, not many
examples of different types of benchmark.
Can you point and give at least a bunch of examples, showing
there is no regression, at least to contradict me. The only
one I succeed to see (in month), is the one given by Steven, a status
quo.

Once again you deliberately choose to ignore the memory saving and
correctness to concentrate on the performance slowdown in some cases.
I will happily accept them. The only think I read is "this is faster",
"it has been tested", ...

I do not believe that you will ever accept any facts unless you yourself
provide them.
 
J

jmfauth

Mark Lawrence:
You've given many examples of the same type of micro benchmark, not many
examples of different types of benchmark.

    Trying to work out what jmfauth is on about I found what appears to
be a performance regression with '<' string comparisons on Windows
64-bit. Its around 30% slower on a 25 character string that differs in
the last character and 70-100% on a 100 character string that differs at
the end.

    Can someone else please try this to see if its reproducible? Linux
doesn't show this problem.

 >c:\python32\python -u "charwidth.py"
3.2 (r32:88445, Feb 20 2011, 21:30:00) [MSC v.1500 64 bit (AMD64)]
a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']176
[0.7116295577956576, 0.7055591343157613, 0.7203483026429418]

a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']176
[0.7664397841378787, 0.7199902325464409, 0.713719289812504]

a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']176
[0.7341851791817691, 0.6994205901833599, 0.7106807593741005]

a=['C:/Users/Neil/Documents/ð €€','C:/Users/Neil/Documents/ð €']180
[0.7346812372666784, 0.6995411113377914, 0.7064768417728411]

 >c:\python33\python -u "charwidth.py"
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit
(AMD64)]
a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']108
[0.9913326076446045, 0.9455845241056282, 0.9459076605341776]

a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']192
[1.0472289217234318, 1.0362342484091207, 1.0197109728048384]

a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']192
[1.0439643704533834, 0.9878581050301687, 0.9949265834034335]

a=['C:/Users/Neil/Documents/ð €€','C:/Users/Neil/Documents/ð €']312
[1.0987483965446412, 1.0130257167690004, 1.024832248526499]

    Here is the code:

# encoding:utf-8
import os, sys, timeit
print(sys.version)
examples = [
"a=['$b','$z']",
"a=['$λ','$η']",
"a=['$b','$η']",
"a=['$\U00020000','$\U00020001']"]
baseDir = "C:/Users/Neil/Documents/"
#~ baseDir = "C:/Users/Neil/Documents/Visual Studio
2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug"
for t in examples:
     t = t.replace("$", baseDir)
     # Using os.write as simple way get UTF-8 to stdout
     os.write(sys.stdout.fileno(), t.encode("utf-8"))
     print(sys.getsizeof(t))
     print(timeit.repeat("a[0] < a[1]",t,number=5000000))
     print()

    For a more significant performance difference try replacingthe
baseDir setting with (may be wrapped):
baseDir = "C:/Users/Neil/Documents/Visual Studio
2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug"

    Neil
--------

Hi,

c:\python32\pythonw -u "charwidth.py"
3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchz']168
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]

a=['D:\jm\jmpy\py3app\stringbenchλ','D:\jm\jmpy\py3app
\stringbenchη']168
[0.818378092261062, 0.8180854713107406, 0.8192279926793571]

a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchη']168
[0.8131353330542339, 0.8126985677326912, 0.8122744051977042]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
\stringbenchð €Â']172
[0.8271094603211102, 0.82704053883214, 0.8265781741004083]
Exit code: 0
c:\Python33\pythonw -u "charwidth.py"
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
(Intel)]
a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchz']94
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]

a=['D:\jm\jmpy\py3app\stringbenchλ','D:\jm\jmpy\py3app
\stringbenchη']176
[1.6217970707185678, 1.6279369907932706, 1.6207041728220117]

a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchη']176
[1.5150522562729396, 1.5130369919353992, 1.5121890607025037]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
\stringbenchð €Â']316
[1.6135375194801664, 1.6117739170366434, 1.6134331526540109]
Exit code: 0

- win7 32-bits
- The file is in utf-8
- Do not be afraid by this output, it is just a copy/paste for your
excellent editor, the coding output pane is configured to use the
locale
coding.
- Of course and as expected, similar behaviour from a console. (Which
btw
show, how good is you application).

==========

Something different.

From a previous msg, on this thread.

---
Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.

I think you mean writing a lot of Latin-1 characters outside
ASCII.
However, even people writing texts in, say, French will find that only
a
small proportion of their text is outside ASCII and so the cost of
UTF-8
is correspondingly small.

The counter-problem is that a French document that needs to
include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string.

---

I already explained this.
It is, how to say, a miss-understanding of Unicode. What's count,
is not the amount of non-ascii chars you have in a stream.
Relevant is the fact that every char is handled with the "same
algorithm", in that case utf-8.
Unicode takes you from the "char" up to the unicode transformated
form. Then it is a question of implementation.

This is exactly what you are doing in Scintilla (maybe without
realizing this deeply).

An editor may reflect very well the example a gave. You enter
thousand ascii chars, then - boum - as you enter a non ascii
char, your editor (assuming is uses a mechanism like the FSR),
has to internally reencode everything!

jmf
 
C

Chris Angelico

An editor may reflect very well the example a gave. You enter
thousand ascii chars, then - boum - as you enter a non ascii
char, your editor (assuming is uses a mechanism like the FSR),
has to internally reencode everything!

That assumes that the editor stores the entire buffer as a single
Python string. Frankly, I think this unlikely; the nature of
insertions and deletions makes this impractical. (I've known editors
that do function this way. They're utterly unusable on large files.)

ChrisA
 
S

Steven D'Aprano

That assumes that the editor stores the entire buffer as a single Python
string. Frankly, I think this unlikely; the nature of insertions and
deletions makes this impractical. (I've known editors that do function
this way. They're utterly unusable on large files.)

Nevertheless, for *some* size of text block (a word? line? paragraph?) an
implementation may need to re-encode the block as characters are inserted
or deleted.

So what? Who cares if it takes 0.00002 second to insert a character
instead of 0.00001 second? That's still a hundred times faster than you
can type.
 
J

jmfauth

That assumes that the editor stores the entire buffer as a single
Python string. Frankly, I think this unlikely; the nature of
insertions and deletions makes this impractical. (I've known editors
that do function this way. They're utterly unusable on large files.)

ChrisA

--------

No, no, no, no, ... as we say in French (this is a kindly
form).

The length of a string may have its importance. This
bad behaviour may happen on every char. The most
complicated chars are the chars with diacritics and
ligatured [1, 2] chars, eg chars used in Arabic script [2].

It is somehow funny to see, the FSR "fails" precisely
on problems Unicode will solve/handle, eg normalization or
sorting [3].

No really a problem for those you are endorsing the good
work Unicode does [5].


[1] A point which was not, in my mind, very well understood
when I read the PEP393 discussion.

[2] Take a unicode "TeX" compliant engine and toy with
the decomposed form of these chars. A very good way, to
understand what can be really a char, when you wish to
process text "seriously".

[3] I only test and tested these "chars" blindly with the help
of the doc I have. Btw, when I test complicated "Arabic chars",
I noticed, Py33 "crashes", it does not really crash, it get stucked
in some king of infinite loop (or is it due to "timeit"?).

[4] Am I the only one who test this kind of stuff?

[5] Unicode is a fascinating construction.

jmf
 
J

jmfauth

On Tue, 02 Apr 2013 19:03:17 +1100, Chris Angelico wrote:

So what? Who cares if it takes 0.00002 second to insert a character
instead of 0.00001 second? That's still a hundred times faster than you
can type.
---------

This not the problem. The interesting point is that they
are good and "less good" Unicode implementations.

jmf
 
M

Mark Lawrence

---------

This not the problem. The interesting point is that they
are good and "less good" Unicode implementations.

jmf

The interesting point is that the Python 3.3 unicode implementation is
correct, that of most other languages is buggy. Or have I fallen victim
to the vicious propaganda of the various Pythonistas who frequent this list?
 
S

Steve Simmons

The interesting point is that the Python 3.3 unicode implementation is
correct, that of most other languages is buggy. Or have I fallen
victim to the vicious propaganda of the various Pythonistas who
frequent this list?
Mark,

Thanks for asking this question.

It seems to me that jmf *might* be moving towards a vindicated
position. There is some interest now in duplicating, understanding and
(hopefully!) extending his test results, which can only be a Good Thing
- whatever the outcome and wherever the facepalm might land.

However, as you rightly point out, there is only value in following this
through if the functionality is (at least near) 100% correct. I am sure
there are some that will disagree but in most cases, functionality is
the primary requirement and poor performance can be managed initially
and fixed in due time.

Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,100
Messages
2,570,635
Members
47,240
Latest member
taarariachand

Latest Threads

Top