S
Steven D'Aprano
Quite difficult. Even if we avoid having two or three separate
binaries, we would still have separate binary representations of the
string structs. It makes the maintainability of the software go down
instead of up.
In fairness, there are already multiple binary representations of strings
in Python 3.3:
- ASCII-only strings use a 1-byte format (PyASCIIObject);
- Compact Unicode objects (PyCompactObject), which if I'm reading
correctly, appears to use a non-fixed width UTF-8 format, but are only
used when the string length and maximum character are known ahead of
time;
- Legacy string objects (PyUnicodeObject), which are not compact, and
which may use as their internal format:
* 1-byte characters for Latin1-compatible strings;
* 2-byte UCS-2 characters for strings in the Basic Multilingual Plane;
* 4-byte UCS-4 characters for strings with at least one non-BMP
character.
http://www.python.org/dev/peps/pep-0393/#specification
By my calculations, that makes *five* different internal formats for
strings, at least two of which are capable of representing all Unicode
characters. I don't think it would add that much additional complexity to
have a runtime option --always-wide-strings to always use the UCS-4
format. For, you know, crazy people with more memory than sense.
But I don't think there's any point in exposing further runtime options
to choose the string representation:
- neither the ASCII nor Latin1 representations can store arbitrary
Unicode chars, so they're out;
- the UTF-8 format is only used under restrictive circumstances, and so
is (probably?) unsuitable for all strings.
- the UCS-2 format can, by using surrogate pairs, but that's troublesome
to get right, some might even say buggy.
So instead of having just one test for my Unicode-handling code, I'll
now have to run that same test *three times* -- once for each possible
string engine option. Choice isn't always a good thing.
There is that too.