Peter said:
More sense? I doubt that. What does make sense is an api that abstracts from
the encoding.
If the application knows which encoding it is so it can convert at all,
and is 'big enough' to bother with encoding back and forth, and the
encoding doesn't already provide what one needs such abstraction to do.
You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.
If you mean 'limited' to some other character set than Unicode, that's
not much use if the appliation is designed for something which has that
'limited' character set/encoding anyway.
I don't understand the question.
I explained that in the next paragraph:
If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified? That's a long way from the world I'm living in.
Besides, even if you have 'everything is Unicode', that still doesn't
necessarily mean UTF-8. It could be UCS-4, or whatever. Unicode or no,
displaying a character does involve telling the OS what encoding is in
use. Or not telling it and trusting the application to handle it, which
is again what's being done outside the Unicode world.
That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.
And the thing about standards is that there are so many of them to
choose from. Enforcing a standard somewhere in an environment where
that is not the standard is not useful. Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side. Standards are supposed to serve us, it's not we who are
supposed to server standards.
I don't understand the question.
You claimed one non-global application where Unicode would have been
good, as an argument that there are no non-global application where
Unicode would not be good.
Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical.
So because you are fond of Unicode, you want to force a quick transition
on everyone else and leave us to deal with the troubles of the
transition, even in cases where things worked perfectly fine without
Unicode.
But I'm pretty sure that "tipping point" where no cases of non-Unicode
is no practical is pretty close to 100% usage of Unicode around the
world.
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.
Why not sort depending on the locale instead of ordinal values of the
bytes/characters?
I'm in Norway. Both Swedes and Germans are foreigners.
At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.
Just that you are fond of Unicode and think that's the Right Solution to
everything, doesn't make other ways of doing things a dirty trick.
As for dirty tricks, that's exactly what such premature standardization
leads to, and one reason I don't like it. Like Perl and Emacs which
have decided that if they don't know which character set is in use, then
it's the character set of the current locale (if they can deduce it) -
even though they have no idea if the data they are processing have
anything to do with the current locale. I wrote a long rant addressed
to the wrong person about that recently; please read article
All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.
I've been wondering about something like that myself, but it still
requires the program to be told which character set is in use so it can
convert back and forth between that and Unicode. To get that right,
Python would need to tag I/O streams and other stuff with their
character set/encoding. And either Python would have to guess when it
didn't know (like looking at the locale's name), or if it didn't
programmers would guess to get rid of the annoyance of encoding
exceptions cropping up everywhere. Then at a later date we'd have to
clean up all the code with the bogus guesses, so the problem would
really just have been transformed to another problem...