Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]
I'm aware of this (and all the blah blah blah you are explaining). This
always the same song. Memory.
Exactly. The reason it is always the same song is because it is an
important song.
Let me ask. Is Python an 'american" product for us-users or is it a tool
for everybody [*]?
It is a product for everyone, which is exactly why PEP 393 is so
important. PEP 393 means that users who have only a few non-BMP
characters don't have to pay the cost of UCS-4 for every single string in
their application, only for the ones that actually require it. PEP 393
means that using Unicode strings is now cheaper for everybody.
You seem to be arguing that the way forward is not to make Unicode
cheaper for everyone, but to make ASCII strings more expensive so that
everyone suffers equally. I reject that idea.
Is there any reason why non ascii users are somehow penalized compared
to ascii users?
Of course there is a reason.
If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.
But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?
This flexible string representation is a regression (ascii users or
not).
No it is not. It is a great step forward to more efficient Unicode.
And it means that now Python can correctly deal with non-BMP characters
without the nonsense of UTF-16 surrogates:
steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong!
2
without doubling the storage of every string.
This is an important step towards making the full range of Unicode
available more widely.
I recognize in practice the real impact is for many users closed to zero
Then what's the problem?
(including me) but I have shown (I think) that this flexible
representation is, by design, not as optimal as it is supposed to be.
You have not shown any real problem at all.
You have shown untrustworthy, edited timing results that don't match what
other people are reporting.
Even if your timing results are genuine, you haven't shown that they make
any difference for real code that does useful work.