Le mercredi 10 juillet 2013 11:00:23 UTC+2, Steven D'Aprano a écrit :
That's by design. We don't want to make the same mistake as Perl, where
every problem is solved by a regular expression:
http://neilk.net/blog/2000/06/01/abigails-regex-to-test-for-prime-numbers/
so we deliberately make regexes as slow as possible so that programmers
will look for a better way to solve their problem. If you check the
source code for the re engine, you'll find that for certain regexes, it
busy-waits for anything up to 30 seconds at a time, deliberately wasting
cycles.
The same with Unicode. We hate French people, you see, and so in an
effort to drive everyone back to ASCII-only text, Python 3.3 introduces
some memory optimizations that ensures that Unicode strings work
correctly and are up to four times smaller than they used to be. You
should get together with jmfauth, who has discovered our dastardly plot
and keeps posting benchmarks showing how on carefully contrived micro-
benchmarks using a beta version of Python 3.3, non-ASCII string
operations can be marginally slower than in 3.2.
I cannot imagine why he would have done that.
This Flexible String Representation is a dream case study.
Attempting to optimize a subset of character is a non sense.
If you are a non-ascii user, such a mechanism is irrelevant,
because per definition you do not need it. Not only it useless,
it is penalizing, just by the fact of its existence. [*]
Conversely (or identically), if you are an ascii user, same situation,
it is irrelevant, useless and penalizing.
Practically, and today, all coding schemes we have
(including the endorsed Unicode utf transformers) work
with a unique set of of encoded code points. If you
wish to take the problem from the other side, it is
because one can only work properly with a unique set
of code points that so many coding schemes exist!
Question: does this FSR use internally three coding
schemes because it splits Unicode in three groups or
does it split Unicode in three subsets to have the joyce
to use three coding schemes?
About "micro benchmarks". What to say, they appear
practivally every time you use non ascii.
And do not forget memory. The €uro just become expensive.
40
I do not know. When an €uro char need 14 bytes more that
a dollar, I belong to those who thing there is a problem
somewhere.
This FSR is a royal gift for those who wish to teach Unicode
and the coding of characters.
jmf