This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".
Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance.
Forget encodings! We're not talking about encodings. Encodings are used
for converting text as bytes for transmission over the wire or storage on
disk. PEP 393 talks about the internal representation of text within
Python, the C-level data structure.
In 3.2, that data structure depends on a compile-time switch. In a
"narrow build", text is stored using two-bytes per character, so the
string "len" (as in the name of the built-in function) will be stored as
006c 0065 006e
(or possibly 6c00 6500 6e00, depending on whether your system is
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.
Since most identifiers are ASCII, that's already using twice as much
memory as needed. This standard data structure is called UCS-2, and it
only handles characters in the Basic Multilingual Plane, the BMP (roughly
the first 64000 Unicode code points). I'll come back to that.
In a "wide build", text is stored as four-bytes per character, so "len"
is stored as either:
0000006c 00000065 0000006e
6c000000 65000000 6e000000
Now memory is cheap, but it's not *that* cheap, and no matter how much
memory you have, you can always use more.
This system is called UCS-4, and it can handle the entire Unicode
character set, for now and forever. (If we ever need more that four-bytes
worth of characters, it won't be called Unicode.)
Remember I said that UCS-2 can only handle the 64K characters
[technically: code points] in the Basic Multilingual Plane? There's an
extension to UCS-2 called UTF-16 which extends it to the entire Unicode
range. Yes, that's the same name as the UTF-16 encoding, because it's
more or less the same system.
UTF-16 says "let's represent characters in the BMP by two bytes, but
characters outside the BMP by four bytes." There's a neat trick to this:
the BMP doesn't use the entire two-byte range, so there are some byte
pairs which are illegal in UCS-2 -- they don't correspond to *any*
character. UTF-16 used those byte pairs to signal "this is half a
character, you need to look at the next pair for the rest of the
character".
Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".
Except this comes at a big cost: you can no longer tell how long a string
is by counting the number of bytes, which is fast, because sometimes four
bytes is two characters and sometimes it's one and you can't tell which
it will be until you actually inspect all four bytes.
Copying sub-strings now becomes either slow, or buggy. Say you want to
grab the 10th characters in a string. The fast way using UCS-2 is to
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we
start counting at zero) and you're done. Fast and safe if you're willing
to give up the non-BMP characters.
It's also fast and safe if you use USC-4, but then everything takes twice
as much space, so you probably end up spending so much time copying null
bytes that you're probably slower anyway. Especially when your OS starts
paging memory like mad.
But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8
and 9 are half of a surrogate pair, and you've now split the pair and
ended up with an invalid string. That's what Python 3.2 does, it fails to
handle surrogate pairs properly:
py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'
I've just split a single valid Unicode character into two invalid
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that
my data is now junk.
Since any character can be a surrogate pair, you have to scan every pair
of bytes in order to index a string, or work out it's length, or copy a
substring. It's not enough to just check if the last pair is a surrogate.
When you don't, you have bugs like this from Python 3.2:
py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)
Which is now fixed in Python 3.3.
So variable-width data structures like UTF-8 or UTF-16 are crap for the
internal representation of strings -- they are either fast or correct but
cannot be both.
But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4
is too because ASCII-only strings like identifiers end up being four
times as big as they need to be. 1-byte schemes like Latin-1 are
unspeakable because they only handle 256 characters, fewer if you don't
count the C0 and C1 control codes.
PEP 393 to the rescue! What if you could encode pure-ASCII strings like
"len" using one byte per character, and BMP strings using two bytes per
character (UCS-2), and fall back to four bytes (UCS-4) only when you
really need it?
The benefits are:
* Americans and English-Canadians and Australians and other barbarians of
that ilk who only use ASCII save a heap of memory;
* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per
character;
* people who use lots of non-BMP characters are no worse off.
The costs are:
* string routines need to be smarter -- they have to handle three
different data structures (ASCII, UCS-2, UCS-4) instead of just one;
* there's a certain amount of overhead when creating a string -- you have
to work out which in-memory format to use, and that's not necessarily
trivial, but at least it's a once-off cost when you create the string;
* people who misunderstand what's going on get all upset over micro-
benchmarks.
Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
To recap:
* Variable-byte formats like UTF-8 and UTF-16 mean that basic string
operations are not O(1) but are O(N). That means they are slow, or buggy,
pick one.
* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP.
That's better than it sounds: the BMP supports most character sets, but
not all. Still, there are people who need the supplementary planes, and
UCS-2 lets them down.
* Fixed width UCS-4 does handle the full Unicode range, without
surrogates, but at the cost of using 2-4 times more string memory for the
vast majority of users.
* PEP 393 doesn't use variable-width characters, but variable-width
strings. Instead of choosing between 1, 2 and 4 bytes per character, it
chooses *per string*. This keeps basic string operations O(1) instead of
O(N), saves memory where possible, while still supporting the full
Unicode range without a compile-time option.