W
wxjmfauth
This is neither a complaint nor a question, just a comment.
In the previous discussion related to the flexible
string representation, Roy Smith added this comment:
http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42
Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."
he used in his comment a very intersting word: "punctuation".
There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.
Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.
A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers.
The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.
Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.
Python, in my mind, fell in this trap.
"Simple is better than complex."
-> hard to maintained
"Flat is better than nested."
-> code points range
"Special cases aren't special enough to break the rules."
-> special unicode code points?
"Although practicality beats purity."
-> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
-> guessing a user will only work with the "optimmized" char subrange.
....
Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.
8040
Just my 2 € (code point 0x20ac) cents.
jmf
In the previous discussion related to the flexible
string representation, Roy Smith added this comment:
http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42
Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."
he used in his comment a very intersting word: "punctuation".
There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.
Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.
A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers.
The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.
Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.
Python, in my mind, fell in this trap.
"Simple is better than complex."
-> hard to maintained
"Flat is better than nested."
-> code points range
"Special cases aren't special enough to break the rules."
-> special unicode code points?
"Although practicality beats purity."
-> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
-> guessing a user will only work with the "optimmized" char subrange.
....
Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.
8040
Just my 2 € (code point 0x20ac) cents.
jmf