T
Terry Reedy
Right. *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.
I know some people say that, but according to the definitions of the
unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent
chars in the Supplementary Planes. The later (1996) UTF-16, which Python
uses, can. The standard considers 'UCS-2' obsolete long ago. See
https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14
The latter says: "Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided."
It goes on: "Sometimes in the past an implementation has been labeled
"UCS-2" to indicate that it does not support supplementary characters
and doesn't interpret pairs of surrogate code points as characters. Such
an implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters."
I know that 16-bit Python *does* use surrogate pairs for supplementary
chars and at least some properties work for them. I am not sure exactly
what the rest means.
However, this is entirely transparent. To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all. The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.
If one uses unicode chars in the Supplementary Planes above the BMP (the
first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16),
then the abstraction leaks.