What encoding does u'...' syntax use?

R

Ron Garret

I would have thought that the answer would be: the default encoding
(duh!) But empirically this appears not to be the case:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)µ

(That last character shows up as a micron sign despite the fact that my
default encoding is ascii, so it seems to me that that unicode string
must somehow have picked up a latin-1 encoding.)

rg
 
S

Stefan Behnel

Ron said:
I would have thought that the answer would be: the default encoding
(duh!) But empirically this appears not to be the case:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)
µ

(That last character shows up as a micron sign despite the fact that my
default encoding is ascii, so it seems to me that that unicode string
must somehow have picked up a latin-1 encoding.)

You are mixing up console output and internal data representation. What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

BTW, Unicode is not an encoding. Wikipedia will tell you more.

Stefan
 
S

Stefan Behnel

Stefan said:
What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

The "seems to" is misleading. The example doesn't actually tell you
anything about the encoding used by your console, except that it can
display non-ASCII characters.

Stefan
 
R

Ron Garret

Stefan Behnel said:
You are mixing up console output and internal data representation. What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

BTW, Unicode is not an encoding. Wikipedia will tell you more.

Yes, I know that. But every concrete representation of a unicode string
has to have an encoding associated with it, including unicode strings
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding? It can't be ascii. So what is
it?

Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't. Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1'). But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error as
calling unicode('\xb5')?

rg
 
T

Terry Reedy

Ron said:
I would have thought that the answer would be: the default encoding
(duh!) But empirically this appears not to be the case:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)

The unicode function is usually used to decode bytes read from *external
sources*, each of which can have its own encoding. So the function
(actually, developer crew) refuses to guess and uses the ascii common
subset.

Unicode literals are *in the source file*, which can only have one
encoding (for a given source file).
(That last character shows up as a micron sign despite the fact that my
default encoding is ascii, so it seems to me that that unicode string
must somehow have picked up a latin-1 encoding.)

I think latin-1 was the default without a coding cookie line. (May be
uft-8 in 3.0).
 
M

Matthew Woodcraft

Ron Garret said:
Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't. Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1'). But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error
as calling unicode('\xb5')?

There is no encoding involved other than ascii, only processing of a
backslash escape.

The backslash escape '\xb5' is converted to the unicode character whose
ordinal number is B5h. This gives the same result as
"\xb5".decode("latin-1") because the unicode numbering is the same as
the 'latin-1' numbering in that range.

-M-
 
M

Martin v. Löwis

Yes, I know that. But every concrete representation of a unicode string
has to have an encoding associated with it, including unicode strings
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).
Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't.

Right. In the former case, \xb5 denotes a Unicode character, namely
U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
as u"\N{MICRO SIGN}". By "the same", I mean "the very same".

OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
byte string with length 1, with a single byte with the numeric
value 0xb5, or 181. It does not, per se, denote any specific character.
It only gets a character meaning when you try to decode it to unicode,
which you do with unicode('\xb5'). This is short for

unicode('\xb5', sys.getdefaultencoding())

and sys.getdefaultencoding() is (or should be) "ascii". Now, in
ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
a character at all), hence you get a UnicodeError.
Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1').

Sure. However, this is only by coincidence, because latin-1 has the same
code points as Unicode (for 0..255).
But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error as
calling unicode('\xb5')?

Because \xb5 *directly* refers to character U+00b5, with no
byte-oriented encoding in-between.

Regards,
Martin
 
M

Martin v. Löwis

Unicode literals are *in the source file*, which can only have one
encoding (for a given source file).


I think latin-1 was the default without a coding cookie line. (May be
uft-8 in 3.0).

It is, but that's irrelevant for the example. In the source

u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

The Unicode literal shown here does not get its interpretation
from Latin-1. Instead, it directly gets its interpretation from
the Unicode coded character set. The string is a short-hand
for

u'\u00b5'

and this denotes character U+00B5 (just as u'\u20ac" denotes
U+20AC; the same holds for any other u'\uXXXX').

HTH,
Martin
 
R

Ron Garret

Unicode literals are *in the source file*, which can only have one
encoding (for a given source file).


I think latin-1 was the default without a coding cookie line. (May be
uft-8 in 3.0).

It is, but that's irrelevant for the example. In the source

u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

The Unicode literal shown here does not get its interpretation
from Latin-1. Instead, it directly gets its interpretation from
the Unicode coded character set. The string is a short-hand
for

u'\u00b5'

and this denotes character U+00B5 (just as u'\u20ac" denotes
U+20AC; the same holds for any other u'\uXXXX').

HTH,
Martin[/QUOTE]

Ah, that makes sense. Thanks!

rg
 
R

Ron Garret

Yes, I know that. But every concrete representation of a unicode string
has to have an encoding associated with it, including unicode strings
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).
Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't.

Right. In the former case, \xb5 denotes a Unicode character, namely
U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
as u"\N{MICRO SIGN}". By "the same", I mean "the very same".

OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
byte string with length 1, with a single byte with the numeric
value 0xb5, or 181. It does not, per se, denote any specific character.
It only gets a character meaning when you try to decode it to unicode,
which you do with unicode('\xb5'). This is short for

unicode('\xb5', sys.getdefaultencoding())

and sys.getdefaultencoding() is (or should be) "ascii". Now, in
ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
a character at all), hence you get a UnicodeError.
Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1').

Sure. However, this is only by coincidence, because latin-1 has the same
code points as Unicode (for 0..255).
But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error as
calling unicode('\xb5')?

Because \xb5 *directly* refers to character U+00b5, with no
byte-oriented encoding in-between.

Regards,
Martin[/QUOTE]

OK, I think I get it now. Thanks!

rg
 
T

Terry Reedy

Martin v. Löwis wrote:
mehow have picked up a latin-1 encoding.)
It is, but that's irrelevant for the example. In the source

u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

I think I understand now that the coding cookie only matters if I use an
editor that actually stores *non-ascii* bytes in the file for the Python
parser to interpret.
 
A

Aahz

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the
countless threads about the distinction between UTF and UCS?
 
T

Thorsten Kampe

* "Martin v. Löwis" (Sat, 21 Feb 2009 00:15:08 +0100)
The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
slight difference to UTF-16/UTF-32).

Thorsten
 
D

Denis Kasak

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
slight difference to UTF-16/UTF-32).

I wouldn't call the difference that slight, especially between UTF-16
and UCS-2, since the former can encode all Unicode code points, while
the latter can only encode those in the BMP.
 
M

Martin v. Löwis

My question is: what is that encoding?
Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the
countless threads about the distinction between UTF and UCS?

You are not misremembering. I personally never found them conclusive,
and, with PEP 261, I think, calling the 2-byte version "UCS-2" is
incorrect.

Regards,
Martin
 
M

Martin v. Löwis

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
I wouldn't call the difference that slight, especially between UTF-16
and UCS-2, since the former can encode all Unicode code points, while
the latter can only encode those in the BMP.

Indeed. As Python *can* encode all characters even in 2-byte mode
(since PEP 261), it seems clear that Python's Unicode representation
is *not* strictly UCS-2 anymore.

Regards,
Martin
 
D

Denis Kasak

Indeed. As Python *can* encode all characters even in 2-byte mode
(since PEP 261), it seems clear that Python's Unicode representation
is *not* strictly UCS-2 anymore.

Since we're already discussing this, I'm curious - why was UCS-2
chosen over plain UTF-16 or UTF-8 in the first place for Python's
internal storage?
 
A

Adam Olsen

Wait, I thought it was UCS-2 or UCS-4?  Or am I misremembering the
countless threads about the distinction between UTF and UCS?

Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer
to Unicode 1.1 and earlier, with no surrogates. We target Unicode
5.1.

If you naively encode UCS-2 as UTF-8 you really end up with CESU-8.
You miss the step where you combine surrogate pairs (which only exist
in UTF-16) into a single supplementary character. Lo and behold,
that's actually what current python does in some places. It's not
pretty.

See bugs #3297 and #3672.
 
M

Martin v. Löwis

Indeed. As Python *can* encode all characters even in 2-byte mode
Since we're already discussing this, I'm curious - why was UCS-2
chosen over plain UTF-16 or UTF-8 in the first place for Python's
internal storage?

You mean, originally? Originally, the choice was only between UCS-2
and UCS-4; choice was in favor of UCS-2 because of size concerns.
UTF-8 was ruled out easily because it doesn't allow constant-size
indexing; UTF-16 essentially for the same reason (plus there was
no point to UTF-16, since there were no assigned characters outside
the BMP).

Regards,
Martin
 
D

Denis Kasak

You mean, originally? Originally, the choice was only between UCS-2
and UCS-4; choice was in favor of UCS-2 because of size concerns.
UTF-8 was ruled out easily because it doesn't allow constant-size
indexing; UTF-16 essentially for the same reason (plus there was
no point to UTF-16, since there were no assigned characters outside
the BMP).

Yes, I failed to realise how long ago the unicode data type was
implemented originally. :)
Thanks for the explanation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top