Proposal: require 7-bit source str's

Neil Hodgson · Aug 7, 2004

Martin v. Löwis:

For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.

Do you have a link to such an encoding? I understand 0x5c, '\' is often
displayed as a yen sign, but haven't seen it as the start byte of a multi
byte character.

Regarding the 's' string prefix in the proposal, adding more prefixes
damages ease of understanding particularly when used in combination. There
should be a very strong need before another is introduced: I'd really hate
to be trying to work out the meaning of:

r$tu"/Raw/ $interpolated, translated Unicode string"

Neil

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 7, 2004

Neil said:
Do you have a link to such an encoding? I understand 0x5c, '\' is often
displayed as a yen sign, but haven't seen it as the start byte of a multi
byte character.

The ISO-2022 ones:'\x1b$B\\_\\n\x1b(B'

ESC $ B and ESC ( B are the codeset switch sequences, and \_ \n are
the actual encodings of the characters.

Regarding the 's' string prefix in the proposal, adding more prefixes
damages ease of understanding particularly when used in combination. There
should be a very strong need before another is introduced: I'd really hate
to be trying to work out the meaning of:

r$tu"/Raw/ $interpolated, translated Unicode string"

Indeed. Perhaps some combinations can be ruled out, though.

Regards,
Martin

Hallvard B Furuseth · Aug 7, 2004

Martin said:
Correct. However, that it works "for a number of source encodings"
is insufficient - if it doesn't work for all of them, it only
unreasonably complicates the code.

For UTF-8 source, the complication might simply be to not call a charset
conversion routine. For some other character sets - well, fixing the
problem below would probably introduce that complication anyway.

For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.

No. It's necessary to convert the source file to logical characters
and feed those to the parser in some way, and conversion to UTF-8 in
a simple way to do that.

I think the 'right way', as far as source character set handling is
concerned, would be to have the source reader and the language parser
cooperate: The reader translates the source file to logical source
characters which it feeds to the parser (UTF-8 is fine for that), and
the parser notifies the reader when it sees the start and end of a
source character string which should be given to the parser in its
original form (by some other means than feeding it to the parser as if
it was charset-converted source code, of course).

Now, that might conflict with Python's design goals, if it is supposed
to be possible to keep the reading and parsing steps separate. Or it
might just take more effort to rearrange the code than anyone is
interested in doing. But in either case it still looks like a bug to
me, even if it's at best a low-priority one.

That is by design. The only effect of such a bug report will be that
the documentation clearly clarifies that.

OK, I'll make it a doc bug.

Peter Otten · Aug 9, 2004

More sense? I doubt that. What does make sense is an api that abstracts from
the encoding. You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.

Yes. What of it?

I don't understand the question.

Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.

If you want an OS that allows that, get an OS which allows that.

That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.

Yes, there are many non-'global' applications too where Unicode is
desirable. What of it?

I don't understand the question.

Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?

Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical.

For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.

Why not sort depending on the locale instead of ordinal values of the
bytes/characters?

At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.

I don't know Perl 6, but Perl 5 is an excellent example of how not do to
this. So is Emacs' MULE, for that matter.

I recently had to downgrade to perl5.004 when perl5.8 broke my programs.
They worked fine until they were moved to a machine where someone had
set up the locale to use UTF-8. Then Perl decided that my data, which
has nothing at all to do with the locale, was Unicode data. I tried to
insert 'use bytes', but that didn't work. It does seem to work in newer
Perl versions, but it's not clear to me how many places I have to insert
some magic to prevent that. Nor am I interested in finding out: I just
don't trust the people who released such a piece of crap to leave my
non-Unicode strings alone. In particular since _most_ of the strings
are UTF-8, so I wonder if Perl might decide to do something 'friendly'
with them.

I see you know more Perl than me - well, my mentioning of the zipper was
rather a lightweight digression prompted by the ongoing decorator frenzy.

Meaning what?

All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.

Peter

Hallvard B Furuseth · Aug 22, 2004

Peter said:
More sense? I doubt that. What does make sense is an api that abstracts from
the encoding.

If the application knows which encoding it is so it can convert at all,
and is 'big enough' to bother with encoding back and forth, and the
encoding doesn't already provide what one needs such abstraction to do.

You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.

If you mean 'limited' to some other character set than Unicode, that's
not much use if the appliation is designed for something which has that
'limited' character set/encoding anyway.

I don't understand the question.

I explained that in the next paragraph:

If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified? That's a long way from the world I'm living in.
Besides, even if you have 'everything is Unicode', that still doesn't
necessarily mean UTF-8. It could be UCS-4, or whatever. Unicode or no,
displaying a character does involve telling the OS what encoding is in
use. Or not telling it and trusting the application to handle it, which
is again what's being done outside the Unicode world.

That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.

And the thing about standards is that there are so many of them to
choose from. Enforcing a standard somewhere in an environment where
that is not the standard is not useful. Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side. Standards are supposed to serve us, it's not we who are
supposed to server standards.

I don't understand the question.

You claimed one non-global application where Unicode would have been
good, as an argument that there are no non-global application where
Unicode would not be good.

Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical.

So because you are fond of Unicode, you want to force a quick transition
on everyone else and leave us to deal with the troubles of the
transition, even in cases where things worked perfectly fine without
Unicode.

But I'm pretty sure that "tipping point" where no cases of non-Unicode
is no practical is pretty close to 100% usage of Unicode around the
world.

For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.

Click to expand...

Why not sort depending on the locale instead of ordinal values of the
bytes/characters?

I'm in Norway. Both Swedes and Germans are foreigners.

At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.

Just that you are fond of Unicode and think that's the Right Solution to
everything, doesn't make other ways of doing things a dirty trick.

As for dirty tricks, that's exactly what such premature standardization
leads to, and one reason I don't like it. Like Perl and Emacs which
have decided that if they don't know which character set is in use, then
it's the character set of the current locale (if they can deduce it) -
even though they have no idea if the data they are processing have
anything to do with the current locale. I wrote a long rant addressed
to the wrong person about that recently; please read article

All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.

I've been wondering about something like that myself, but it still
requires the program to be told which character set is in use so it can
convert back and forth between that and Unicode. To get that right,
Python would need to tag I/O streams and other stuff with their
character set/encoding. And either Python would have to guess when it
didn't know (like looking at the locale's name), or if it didn't
programmers would guess to get rid of the annoyance of encoding
exceptions cropping up everywhere. Then at a later date we'd have to
clean up all the code with the bogus guesses, so the problem would
really just have been transformed to another problem...

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 22, 2004

Hallvard said:
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.

Click to expand...

Why not sort depending on the locale instead of ordinal values of the
bytes/characters?

Click to expand...

I'm in Norway. Both Swedes and Germans are foreigners.

I agree with many things you said, but this example is bogus. If I
(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
as you say, ö sorts with o in my language - yet the simple sorting of
ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.

Likewise, sorting *is* possible with Unicode if you take the locale into
account. The order of character doesn't have to be the numerical one,
and, as you explain, it might even depend on the locale. So if you
want a Swedish collaction, use a Swedish locale; if you want a German
collation, use a German locale.

Regards,
Martin

Hallvard B Furuseth · Aug 22, 2004

Martin said:
Hallvard said:

For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.

Why not sort depending on the locale instead of ordinal values of the
bytes/characters?

Click to expand...

I'm in Norway. Both Swedes and Germans are foreigners.

Click to expand...

I agree with many things you said, but this example is bogus. If I
(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
as you say, ö sorts with o in my language - yet the simple sorting of
ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.

Sorry, I seem to a left out a vital point here: I thought the correct -
or rather, least incorrect - ns_4551-1 character for German ö was o, not
ø. Then it works out. Oh well, one learns something every day. Time
to check if there are other examples, or if I can forget it... Gotta
try an easy one - would you also translate German ä to æ rather than a?

Likewise, sorting *is* possible with Unicode if you take the locale
into account. The order of character doesn't have to be the numerical
one, and, as you explain, it might even depend on the locale. So if
you want a Swedish collaction, use a Swedish locale; if you want a
German collation, use a German locale.

And if I want to get both right, I need a sort_name field which is
distinct from the display_name field. There you would be lowis, while
the Swede Törnquist would be tørnquist. Or maybe lowis\tlöwis or
something; a kind of private implementation of strxfrm().

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 22, 2004

Ah, I missed the point that there is no ö in ns_4551-1. If so, then the
best way to represent the characters is to replace ö with "oe" and ä
with "ae"; replacing them merely with "o" and "a" would be considered
inadequat.

And if I want to get both right, I need a sort_name field which is
distinct from the display_name field. There you would be lowis, while
the Swede Törnquist would be tørnquist. Or maybe lowis\tlöwis or
something; a kind of private implementation of strxfrm().

But you can have a strxfrm for Unicode as well! There is nothing
inherent in Unicode that prevents using the same approach.

Of course, the question always is what result you *want*: If you
have text that contains simultaneously Latin and Greek characters,
how would you like to collate it? Neither the German or Greek
collation rules are likely to help, as they don't consider the issue
of additional alphabets. If possible, you should assign a language
tag to each entry, and then sort first by language, then according
to the language's collation rules.

Regards,
Martin

Hallvard B Furuseth · Aug 22, 2004

Martin said:
Ah, I missed the point that there is no ö in ns_4551-1. If so, then the
best way to represent the characters is to replace ö with "oe" and ä
with "ae"; replacing them merely with "o" and "a" would be considered
inadequat.

Duh. Of course. We usually did that too when we had to write Norwegian
in ASCII. It bites sometimes, though - like when it hits the common '1
character = 1 byte' assumption which someone -- John Roth? mentioned.
Maybe that's why we are getting to ø->o in e-mail addresses and such
things nowadays, to keep things simple.

In a way, it is rather nice to notice that I'm forgetthing that stuff.
Maybe someday I won't even be able to read texts with {|} for æøå
without slowing down

But you can have a strxfrm for Unicode as well! There is nothing
inherent in Unicode that prevents using the same approach.

Not after you have discarded the information which says whether to sort
ö as ø or o.

Of course, the question always is what result you *want*: If you
have text that contains simultaneously Latin and Greek characters,
how would you like to collate it? Neither the German or Greek
collation rules are likely to help, as they don't consider the issue
of additional alphabets.

True enough. But when you mix entirely different scripts, you have
worse problems anyway; you'll often need to transliterate your name to
the local script - or to something close to English, I guess. A written
name in a script the locals can't read isn't particularly useful.

If possible, you should assign a language tag to each entry, and then
sort first by language, then according to the language's collation
rules.

That sounds very wrong for lists that are sorted for humans to search,
unless I misunderstand you. That would place all Swedes after all
Norwegians in the phone book, for example. And if you aren't sure of
the nationality of someone, you'd have to look through all foreign
languages that are present.

Peter Otten · Aug 29, 2004

Hallvard said:
If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified? That's a long way from the world I'm living in.

It's even worse. I think conceptually there is "One True Character Set" of
which unicode is the closest approximation -- yes, I know that this
position is "idealism" by its philosophical definition.

And the thing about standards is that there are so many of them to
choose from. Enforcing a standard somewhere in an environment where
that is not the standard is not useful. Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side. Standards are supposed to serve us, it's not we who are
supposed to server standards.

If you go to GB from the continent it is clear that you have to switch
lanes. You can still get it wrong but either completely or not at all.

Now consider a road you can drive on in many directions, say 100, with two
or three directions allowed simultaneously in one country. The best
available method to find out the correct direction would be to drive a few
kilometers and then get out of the car and look for damages in the car's
body. If there are dents you had an accident, so either you or another car
took the wrong lane...
How is it that many drive faithfully then? The dominant car-make has a
preference built-in. When they drive on the internet, everyone ignores the
signs and just drives on the same lane as anybody else...

By the way, I'm not "fond" of unicode, There may even be problems that
cannot be solved in principle by a universal standard (like your sorting
across three locales). I just think unicode would make a better default
than what we have now and many apps that will break in the transition are
broken now - you just didn't realize it.

Peter

Pre-PEP Proposal: Codetags	8	Aug 11, 2005
EBCDIC <--> ASCII	4	Dec 4, 2008
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009
Wildcard String Comparisons: Set Pattern to a Wildcard Source	7	Oct 5, 2010
PEP 383: Non-decodable Bytes in System Character Interfaces	1	Apr 22, 2009
Best ways of managing text encodings in source/regexes?	6	Nov 26, 2007
Tabs versus Spaces in Source Code	135	May 15, 2006
Questions on various string literals in c++0x	1	Dec 7, 2010

Proposal: require 7-bit source str's

Neil Hodgson

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Hallvard B Furuseth

Peter Otten

Hallvard B Furuseth

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Hallvard B Furuseth

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Hallvard B Furuseth

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads