S
Steven D'Aprano
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
I think you misunderstand the PEP then, because that is empirically
false.
Yes I did misunderstand. Thank you for the clarification.
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
I think you misunderstand the PEP then, because that is empirically
false.
"Steven D'Aprano" wrote in message
[...]If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:
http://bugs.python.org/
Don't forget to report your operating system.
This is an average slowdown by a factor of close to 2.3 on 3.3 when
compared with 3.2.
I am not posting this to perpetuate this thread but simply to ask
whether, as you suggest, I should report this as a possible problem with
the beta?
Running Python from a Windows command prompt, I got the following on
Python 3.2.3 and 3.3 beta 2:
python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 39.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','……')"
10000 loops, best of 3: 51.8 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','x…')"
10000 loops, best of 3: 52 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','œ…')"
10000 loops, best of 3: 50.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','€…')"
10000 loops, best of 3: 51.6 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 38.3 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 50.3 usec per loop
python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 24.5 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','……')"
10000 loops, best of 3: 24.7 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','x…')"
10000 loops, best of 3: 24.8 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','œ…')"
10000 loops, best of 3: 24 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','€…')"
10000 loops, best of 3: 24.1 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 24.4 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 24.3 usec per loop
How convenient.
Well, it seems some software producers know what they
are doing.
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)
Python has often copied or borrowed, with adjustments. This time it is the
first. We will see how it goes, but it has been tested for nearly a year
already.
Chris Angelico said:Really, the only viable alternative to PEP 393 is a fixed 32-bit
representation - it's the only way that's guaranteed to provide
equivalent semantics. The new storage format is guaranteed to take no
more memory than that, and provide equivalent functionality.
Steven D'Aprano said:Of course *if* k is constant, O(k) is constant too, but k is not
constant. In context we are talking about string indexing and slicing.
There is no value of k, say, k = 2, for which you can say "People will
sometimes ask for string[2] but never ask for string[3]". That is absurd.
"Steven D'Aprano" wrote in message
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
[...]
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:
http://bugs.python.org/
Don't forget to report your operating system.
====================================================
For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)
running Windows 7 x64.
Running Python from a Windows command prompt, I got the following on Python
3.2.3 and 3.3 beta 2:
python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 39.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 51.8 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 52 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 50.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 51.6 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 38.3 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 50.3 usec per loop
python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 24.5 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 24.7 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 24.8 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 24 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 24.1 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 24.4 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 24.3 usec per loop
This is an average slowdown by a factor of close to 2.3 on 3.3 when compared
with 3.2.
I am not posting this to perpetuate this thread but simply to ask whether,
as you suggest, I should report this as a possible problem with the beta?
Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier.
Pike has a single string type, which can use the full Unicode range.
If all codepoints are <256, the string width is 8 (measured in bits);
if <65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)
However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a >255 codepoint - or perhaps worse, when you
have a 127<x<256, and the other end misinterprets it.
Python writes strings to file objects, including open sockets, without
creating a bytes object -- IF the file is opened in text mode, which always
has an associated encoding, even if the default 'ascii'. From what you say,
this is what Pike is missing.
I am pretty sure that the obvious optimization has already been done. The
internal bytes of all-ascii text can safely be sent to a file with ascii (or
ascii-compatible) encoding without intermediate 'decoding'. I remember
several patches of that sort. If a string is internally ucs2 and the file is
declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly
(possibly with a byte swap).
In the primordial days of computing, using 8 bits to store a character
was a profligate waste of memory. What on earth did people need with
TWO cases of the alphabet
(not to mention all sorts of weird
punctuation)? Eventually, memory became cheap enough that the
convenience of using one character per byte (not to mention 8-bit bytes)
outweighed the costs. And crazy things like sixbit and rad-50 got swept
into the dustbin of history.
So it may be with utf-8 someday.
So it may be with utf-8 someday.
Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".
Only if you believe that people's ability to generate data will remain
lower than people's ability to install more storage.
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
Well, it seems some software producers know what they
are doing.
Traceback (most recent call last):
 File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)
Steven D'Aprano said:Paul Rubin already told you about his experience using OCR to generate
multiple terrabytes of text, and how he would not be happy if that was
stored in UCS-4.
Pittance or not, I do not believe that people will widely abandon compact
storage formats like UTF-8 and Latin-1 for UCS-4 any time soon.
absurd.Steven D'Aprano said:Of course *if* k is constant, O(k) is constant too, but k is not
constant. In context we are talking about string indexing and slicing.
There is no value of k, say, k = 2, for which you can say "People will
sometimes ask for string[2] but never ask for string[3]". That is
The context was parsing, e.g. recognizing a token like "a" or "foo" in a
human-written chunk of text. Occasionally it might be "sesquipidalian"
or some even worse outlier, but one can reasonably put a fixed and
relatively small upper bound on the expected value of k. That makes the
amortized complexity O(1), I think.
Michael Torrie said:Python generally tries to follow unicode
encoding rules to the letter. Thus if a piece of text cannot be
represented in the character set of the terminal, then Python will
properly err out. Other languages you have tried, likely fudge it
somehow.
Oscar Benjamin said:No it doen't. It is still O(k). The point of big O notation is to
understand the asymptotic behaviour of one variable as it becomes
large because of changes in other variables.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.