Unicode characters in btye-strings

S

Steven D'Aprano

I know this is wrong, but I'm not sure just how wrong it is, or why.
Using Python 2.x:
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a
non-unicode string? My guess is that the result will depend on the
current encoding of my terminal.

In this case, my terminal is set to UTF-8. If I change it to ISO 8859-1,
and repeat the above, I get this:
['\xe9', '\xe2', '\xc4']

If I do this:
'\xe9\xe2\xc4'

which at least explains why the bytes have the values which they do.


Thank you,
 
R

Robert Kern

I know this is wrong, but I'm not sure just how wrong it is, or why.
Using Python 2.x:
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a
non-unicode string? My guess is that the result will depend on the
current encoding of my terminal.

Exactly right.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
M

Martin v. Loewis

Can somebody explain what happens when I put non-ASCII characters into a
Exactly right.

To elaborate on the "what happens" part: the string that gets entered is
typically passed as a byte sequence, from the terminal (application) to
the OS kernel, from the OS kernel to Python's stdin, and from there to
the parser. Python recognizes the string delimiters, but (practically)
leaves the bytes between the delimiters as-is (*), creating a byte
string object with the very same bytes.

The more interesting question is what happens when you do

py> s = u"éâÄ"

Here, Python needs to decode the bytes, according to some encoding.
Usually, it would want to use the source encoding (as given through
-*- Emacs -*- markers), but there are none. Various Python versions then
try different things; what they should do is to determine the terminal
encoding, and decode the bytes according to that one.

Regards,
Martin

(*) If a source encoding was given, the source is actually recoded to
UTF-8, parsed, and then re-encoded back into the original encoding.
 
M

Michael Rudolf

Am 12.03.2010 21:56, schrieb Martin v. Loewis:
(*) If a source encoding was given, the source is actually recoded to
UTF-8, parsed, and then re-encoded back into the original encoding.

Why is that? So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

Need citation plz.

Thx,
Michael
 
J

John Bokma

Michael Rudolf said:
Am 12.03.2010 21:56, schrieb Martin v. Loewis:

Why is that? So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

utf8 is a Unicode *encoding*.
 
M

Martin v. Loewis

Michael said:
Am 12.03.2010 21:56, schrieb Martin v. Loewis:

Why is that?

Why is what? That string literals get reencoded into the source encoding?
So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

No. String literals, in 2.x, are not written with u"", and are stored in
the source encoding. Above procedure applies to regular strings (see
where the "*" goes in my original article).
Need citation plz.

You really want a link to the source code implementing that?

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top