sys.version '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
bit (Intel)]'61.919225272152346
"imeit"?
It is hard to take your results seriously when you have so obviously
edited your timing results, not just copied and pasted them.
Here are my results, on my laptop running Debian Linux. First, testing on
Python 3.2:
steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 49.7 usec per loop
As you can see, the timing results are all consistently around 50
microseconds per loop, regardless of which characters I use, whether they
are in Latin-1 or not. The differences between one test and another are
not meaningful.
Now I do them again using Python 3.3:
steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 66.9 usec per loop
The results are all consistently around 67 microseconds. So Python's
string handling is about 30% slower in the examples show here.
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:
http://bugs.python.org/
Don't forget to report your operating system.
My take of the subject.
This is a typical Python desease. Do not solve a problem, but find a
way, a workaround, which is expecting to solve a problem and which
finally solves nothing. As far as I know, to break the "BMP limit", the
tools are here. They are called utf-8 or ucs-4/utf-32.
The problem with UCS-4 is that every character requires four bytes.
Every. Single. One.
So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but
of course UCS-2 can only represent characters in the BMP. A pure ASCII
string would only take 11 bytes, but we're not going back to pure ASCII.
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)
The difference between 44 bytes and 22 bytes for one little string is not
very important, but when you double the memory required for every single
string it becomes huge. Remember that every class, function and method
has a name, which is a string; every attribute and variable has a name,
all strings; functions and classes have doc strings, all strings. Strings
are used everywhere in Python, and doubling the memory needed by Python
means that it will perform worse.
With PEP 393, each Python string will be stored in the most efficient
format possible:
- if it only contains ASCII characters, it will be stored using 1 byte
per character;
- if it only contains characters in the BMP, it will be stored using
UCS-2 (2 bytes per character);
- if it contains non-BMP characters, the string will be stored using
UCS-4 (4 bytes per character).