I assure you that I fully understand my ignorance of unicode.
Robin, while I'm very happy to see that you have a good grasp of what you
don't know, I'm afraid that you're misrepresenting me. You deleted the
part of my post that made it clear that I was referring to our resident
Until
recently I didn't even know that the unicode in python 2.x is considered
broken and that str in python 3.x is considered 'better'.
No need for scare quotes.
The unicode type in Python 2.x is less-good because:
- it is not the default string type (you have to prefix the string
with a u to get Unicode);
- it is missing some functionality, e.g. casefold;
- there are two distinct implementations, narrow builds and wide builds;
- wide builds take up to four times more memory per string as needed;
- narrow builds take up to two times more memory per string as needed;
- worse, narrow builds have very naive (possibly even "broken")
handling of code points in the Supplementary Multilingual Planes.
The unicode string type in Python 3 is better because:
- it is the default string type;
- it includes more functionality;
- starting in Python 3.3, it gets rid of the distinction between
narrow and wide builds;
- which reduces the memory overhead of strings by up to a factor
of four in many cases;
- and fixes the issue of SMP code points.
I can say that having made a lot of reportlab work in both 2.7 & 3.3 I
don't understand why the latter seems slower especially since we try to
convert early to unicode/str as a desirable internal form.
*shrug*
Who knows? Is it slower or does it only *seem* slower? Is the performance
regression platform specific? Have you traded correctness for speed, that
is, does 2.7 version break when given astral characters on a narrow build?
Earlier in January, you commented in another thread that
"I'm not sure if we have any non-bmp characters in the tests."
If you don't, you should have some.
There's all sorts of reasons why your code might be slower under 3.3,
including the possibility of a non-trivial performance regression. If you
can demonstrate a test case with a significant slowdown for real-world
code, I'm sure that a bug report will be treated seriously.
Probably I
have some horrible error going on(eg one of the C extensions is working
in 2.7 and not in 3.3).
Well that might explain a slowdown.
But really, one should expect that moving from single byte strings to up
to four-byte strings will have *some* cost. It's exchanging functionality
for time. The same thing happened years ago, people used to be extremely
opposed to using floating point doubles instead of singles because of
performance. And, I suppose it is true that back when 64K was considered
a lot of memory, using eight whole bytes per floating point number (let
alone ten like the IEEE Extended format) might have seemed the height of
extravagance. But today we use doubles by default, and if singles would
be a tiny bit faster, who wants to go back to the bad old days of single
precision?
I believe the same applies to Unicode versus single-byte strings.