Unicode question

  • Thread starter =?ISO-8859-1?Q?Gerhard_H=E4ring?=
  • Start date
?

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

u"äöü"
u'\x84\x94\x81'

(Python 2.2.3/2.3b2; sys.getdefaultencoding() == "ascii")

Why does this work?

Does Python guess which encoding I mean? I thought Python should refuse
to guess :)


-- Gerhard
 
T

Thomas Heller

Gerhard Häring said:
u'\x84\x94\x81'

(Python 2.2.3/2.3b2; sys.getdefaultencoding() == "ascii")

Why does this work?

Does Python guess which encoding I mean? I thought Python should
refuse to guess :)

I stumbled over this yesterday, and it seems it is (at least) partially
answered by PEP 263:

In Python 2.1, Unicode literals can only be written using the
Latin-1 based encoding "unicode-escape". This makes the programming
environment rather unfriendly to Python users who live and work in
non-Latin-1 locales such as many of the Asian countries. Programmers
can write their 8-bit strings using the favorite encoding, but are
bound to the "unicode-escape" encoding for Unicode literals.

I have the impression that this is undocumented on purpose, because you
should not write unescaped non-ansi characters into the source file
(with 'unknown' encoding).

Thomas
 
?

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Thomas said:
I stumbled over this yesterday, and it seems it is (at least) partially
answered by PEP 263:

In Python 2.1, Unicode literals can only be written using the
Latin-1 based encoding "unicode-escape". This makes the programming
environment rather unfriendly to Python users who live and work in
non-Latin-1 locales such as many of the Asian countries. Programmers
can write their 8-bit strings using the favorite encoding, but are
bound to the "unicode-escape" encoding for Unicode literals.

I have the impression that this is undocumented on purpose, because you
should not write unescaped non-ansi characters into the source file
(with 'unknown' encoding).

I agree that using latin1 as default is bad. If there's an encoding
cookie in the 2.3+ source file then this encoding could be used.

I stumbled on this when giving another Python user on this list a
pointer to the relevant section in the Python tutorial
(http://www.python.org/doc/current/tut/node5.html#SECTION005130000000000000000)
where Guido uses u"äöü" in an example.

As this is BAD the tutorial should probably be changed. I'll file a bug
report.

-- Gerhard
 
G

Guest

Gerhard said:
Ricardo said:
u"äöü"

u'\x84\x94\x81'
[this works, but IMO shouldn't]
[...]
You'll get warnings if you don't define an encoding (either encoding
cookie or BOM) and use 8-Bit characters in your source files. These
warnings will becomome errors in later Python versions.

It's all in the PEP :)

I feel like an idiot now :-( I do get the warnings when I run a Python
script, but I do not get the warnings when I'm using the interactive
prompt. So it's all good (almost). Why not also produce warnings at the
interactive prompt?

-- Gerhard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,079
Messages
2,570,574
Members
47,206
Latest member
Zenden

Latest Threads

Top