Unicode characters in btye-strings

Steven D'Aprano · Mar 12, 2010

I know this is wrong, but I'm not sure just how wrong it is, or why.
Using Python 2.x:
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a
non-unicode string? My guess is that the result will depend on the
current encoding of my terminal.

In this case, my terminal is set to UTF-8. If I change it to ISO 8859-1,
and repeat the above, I get this:
['\xe9', '\xe2', '\xc4']

If I do this:
'\xe9\xe2\xc4'

which at least explains why the bytes have the values which they do.

Thank you,

Robert Kern · Mar 12, 2010

I know this is wrong, but I'm not sure just how wrong it is, or why.
Using Python 2.x:
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a
non-unicode string? My guess is that the result will depend on the
current encoding of my terminal.

Exactly right.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Martin v. Loewis · Mar 12, 2010

Can somebody explain what happens when I put non-ASCII characters into a

Exactly right.

To elaborate on the "what happens" part: the string that gets entered is
typically passed as a byte sequence, from the terminal (application) to
the OS kernel, from the OS kernel to Python's stdin, and from there to
the parser. Python recognizes the string delimiters, but (practically)
leaves the bytes between the delimiters as-is (*), creating a byte
string object with the very same bytes.

The more interesting question is what happens when you do

py> s = u"Ã©Ã¢Ã„"

Here, Python needs to decode the bytes, according to some encoding.
Usually, it would want to use the source encoding (as given through
-*- Emacs -*- markers), but there are none. Various Python versions then
try different things; what they should do is to determine the terminal
encoding, and decode the bytes according to that one.

Regards,
Martin

(*) If a source encoding was given, the source is actually recoded to
UTF-8, parsed, and then re-encoded back into the original encoding.

Michael Rudolf · Mar 12, 2010

Am 12.03.2010 21:56, schrieb Martin v. Loewis:

(*) If a source encoding was given, the source is actually recoded to
UTF-8, parsed, and then re-encoded back into the original encoding.

Why is that? So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

Need citation plz.

Thx,
Michael

John Bokma · Mar 12, 2010

Michael Rudolf said:
Am 12.03.2010 21:56, schrieb Martin v. Loewis:

Why is that? So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

utf8 is a Unicode *encoding*.

Martin v. Loewis · Mar 12, 2010

Michael said:
Am 12.03.2010 21:56, schrieb Martin v. Loewis:

Why is that?

Why is what? That string literals get reencoded into the source encoding?

So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

No. String literals, in 2.x, are not written with u"", and are stored in
the source encoding. Above procedure applies to regular strings (see
where the "*" goes in my original article).

Need citation plz.

You really want a link to the source code implementing that?

Regards,
Martin

UTF-8 characters in doctest	6	Sep 19, 2007
Output confusion	2	Mar 9, 2023
Thinking Unicode	0	Aug 8, 2013
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
Replacement in unicodestrings?	1	Oct 5, 2008
Unicode characters, XML/RSS	1	Jul 31, 2008
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
idle 2.x and unicode literals	0	Apr 2, 2010

Unicode characters in btye-strings

Steven D'Aprano

Robert Kern

Martin v. Loewis

Michael Rudolf

John Bokma

Martin v. Loewis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads