S
Steven D'Aprano
His most recent argument that Python should use UTF as a representation
is very strange to be honest.
He's not arguing for anything, he is just hating on anything that gives
even the tiniest benefit to ASCII users. This isn't about Python 3.3.
hurting non-ASCII users, because that is demonstrably untrue: they are
*better off* in Python 3.3. This is about denying even a tiny benefit to
ASCII users.
In Python 3.3, non-ASCII users have these advantages compared to previous
versions:
- strings will usually take less memory, and aside from trivial changes
to the object header, they never take more memory than a wide build would
use;
- consequently nearly all objects will take less memory (especially
builtins and standard library objects, which are all ASCII), since
objects contain dozens of internal strings (attribute and method names in
__dict__, class name, etc.);
- consequently whole-application benchmarks show most applications will
use significantly less memory, which leads to faster speeds;
- you cannot break surrogate pairs apart by accident, which you can do in
narrow builds;
- in previous versions, code which works when run in a wide build may
fail in a narrow build, but that is no longer an issue since the
distinction between wide and narrow builds is gone;
- Latin1 users, which includes JMF himself, will likewise see memory
savings, since Latin1 strings will take half the size of narrow builds
and a quarter the size of wide builds.
The cost of all these benefits is a small overhead when creating a string
in the first place, and some purely internal added complication to the
string implementation.
I'm the first to argue against complication unless there is a
corresponding benefit. This is a case where the benefit has proven itself
doubly: Python 3.3's Unicode implementation is *more correct* than
before, and it uses less memory to do so.
The cons of UTF are apparent and widely
known. The main con is that UTF strings are O(n) for indexing a
position within the string.
Not so for UTF-32.