You're posting to both comp.lang.python and python-list, are you aware
that that's redundant?
This flexible string representation is wrong by design.
Expecting to divide "Unicode" in chunks and to gain something
is an illusion.
It has been created by a computer scientist who thinks "bytes"
when on that field one has to think "bytes" and usage of the
characters at the same time.
There's another range of numbers that, in some languages, is divided
for efficiency's sake: Integers below 1<<[bit size]. In Python 2, such
numbers were an entirely different data type (int vs long); other
languages let you use the same data type for both, but "(1<<5)+1" will
be executed much faster than "(1<<500)+1". (And far as I know, a
conforming Python 3 implementation should be allowed to do that; 3.2
on Windows doesn't seem to, though.) That's all PEP 393 is; it's a
performance improvement for a particular subset of values that happens
to fit conveniently into the underlying machine's data storage.
If Python were implemented on a 9-bit computer, I wouldn't be
surprised if the PEP 393 optimizations were applied differently. It's
nothing to do with Latin-1, except insofar as the narrowest form of
string _happens_ to contain everything that's in Latin-1.
Go blame the Unicode consortium for picking that.
The latin-1 chunk illustrates this wonderfully.
Aside from replace(), as mentioned in this thread, are there any other
ways that this is so wonderfully illustrated? Or is it "wonderfully"
as in "I wonder if people will believe me if I keep spouting
unsubstantiated claims"?
ChrisA