unicode question

wolfgang haefelinger · Nov 20, 2004

Hi,

I wonder whether someone could explain me a bit what's going on here:

import sys

# I'm running Mandrake 1o and Windows XP.
print sys.version

## 2.3.3 (#2, Feb 17 2004, 11:45:40) [GCC 3.3.2 (Mandrake Linux 10.0
3.3.2-6mdk)]
## 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)]

print "sys.getdefaultencoding = ",sys.getdefaultencoding()
# This prints always "ascii" ..

## just a class
class Y:
def __str__(self):
return self.c

## define unicode character (ie. string)
gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"

y = Y()
y.c = gamma

## works fine: prints greek capital gamma on terminal on windows (chcp 437).
## Mandrake 1o nothing gets printed but at least no excecption gets thrown.
print gamma # (1)

## same as before ..
print y.__str__() # (2)

## encoding error
print y # (3) ??????????????

## ascii encoding error ..
sys.stdout.write(gamma) # (4)

I wonder especially about case 2. I can see that "print y" makes a call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

Thanks for any help,
Wolfgang.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 20, 2004

wolfgang said:
I wonder especially about case 2. I can see that "print y" makes a call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

It looks at sys.stdout.encoding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

Regards,
Martin

Kent Johnson · Nov 21, 2004

Martin said:
It looks at sys.stdout.encoding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

I hate to contradict an expert, but ISTM that it is
sys.getdefaultencoding() ('ascii') that is the problem, not
sys.stdout.encoding ('cp437')

gamma converts to cp437 just fine:Î“
(prints a gamma)

Trying to encode gamma using the 'ascii' codec doesn't work:Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0393' in
position 0: ordinal not in range(128)

My guess is that internally, print keeps calling str() on its argument
until it gets a string object. So it calls y.__str__() yielding gamma,
then gamma.__str__() which raises the error.

If the default encoding is set to cp437 then it works fine:
Î“
(prints a gamma)
Î“
(prints a gamma)

Kent

Guest · Nov 21, 2004

Kent said:
I hate to contradict an expert, but ISTM that it is
sys.getdefaultencoding() ('ascii') that is the problem, not
sys.stdout.encoding ('cp437')

It seems we were answering different parts of the question. I answered
the part "What is 'print' exactly doing"; you answered the part as to
what the problem with str() conversion is (although I'm not sure whether
the OP has actually asked that question).

Also, the one case that is interesting here was not in your experiment:
try

print gamma

This should work, regardless of sys.getdefaultencoding(), as long as
sys.stdout.encoding supports the characters to be printed.

Regards,
Martin

wolfgang haefelinger · Nov 21, 2004

Hi Experts,

I'm actually not a Python expert so please bear with me and my naive
questions and remarks:

I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
x = x.__str__()
sys.stdout.write(x)

Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()

Given this assumption I'm wondering then why print x.__str__()
works but print x does not?

Is this a bug??

Cheers,
Wolfgang.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 21, 2004

wolfgang said:
I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
x = x.__str__()
sys.stdout.write(x)

This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.encoding:
x = x.encode(sys.stdout.encoding)
x = str(x)
sys.stdout.write(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).

Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()

No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.

Is this a bug??

No. You are just misunderstanding it.

Regards,
Martin

wolfgang haefelinger · Nov 22, 2004

Hi Martin,

if print is implemented like this then I begin to understand the problem.

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.

Anyway, thanks for answering
Wolfgang.

Bengt Richter · Nov 22, 2004

Hi Martin,

if print is implemented like this then I begin to understand the problem.

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.

Anyway, thanks for answering
Wolfgang.

It's an old issue, and ISTM there is either a problem or it needs to be better explained.
My bet is on a problem ;-) ISTM the key is that a plain str type is a byte sequence but can
be interpreted as a byte-stream-encoded character sequence, and there are some seemingly
schizophrenic situations. E.g., start with a sequence of numbers, obviously just produced
by a polynomial formula having nothing to do with characters:

>>> numbers = [(lambda x: (-499*x**4 +4634*x**3 -13973*x**2 +13918*x +1824)/24)(x) for x in xrange(5)]
>>> numbers

Click to expand...

Click to expand...

[76, 246, 119, 105, 115]

Now if we convert those to str type characters with chr() and join them:

Then we have a sequence of bytes which could have had any numerical value in range(256). No character
encoding is assumed. Yet. If we now assume, say, a latin-1 encoding, we can decode the bytes into
unicode:
<type 'unicode'>

Now if we print that, sys.stdout.encoding should come into play:
Löwis

And we are ok, because we were explicit the whole way.
But if we don't decode s explicitly, it seems the system makes an assumption:
L÷wis

That is (if it survived) the 'cp437' character for byte '\xf6'. IOW, print seems
to assume that a plain str is encoded ready for output in sys.stdout.encoding in
a kind of reinterpret_cast of the str, or else a decode('cp437').encode('cp437')
optimized away.
'ascii'

If it were assuming s was encoded as ascii, it should really do s.decode('ascii').encode('cp437')
to get it printed, but for plain str literals it does not seem to do that. I.e.,
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128

doesn't work, so it can't be doing that. It seems to print s as s.decode('cp437').encode('cp437')
u'L\xf7wis'

but that is a wrong decoding, (though the system can't be expected to know).
Löwis

What other decoding should be attempted, lacking an indication? sys.getdefaultencoding()
might be reasonable, but it seems to be locked into 'ascii' (I don't know how to set it)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128

So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?
... def __str__(self): return self.c
... Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128) Löwis

Maybe the output of __str__ should be ok as a type basestring subclass for print, so
y.c = u
print y
above has the same result as
print u

It seems to be trying to do u.encode('ascii').decode('ascii').encode('cp437')
instead of directly u.encode('cp437') when __str__ is involved.
Löwis

works, and
Löwis

works, and
Löwis

and
Löwis

works,

u'L\xf6wis'

but

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128)

and never mind print,
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128)

I guess its that str.__mod__(self, other) can deal with a unicode other and get promoted, but
it must do str(other) instead of other.__str__(), or it would be able to promote the result in
the latter case too...

This seems like a possible change that could smooth things a bit, especially if print a,b,c
was then effectively the same as print ('%s'%a),('%s'%b),('%s'%c) with encoding promotion.

Regards,
Bengt Richter

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 22, 2004

wolfgang said:
Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Notice that this also fails

x=str(y)

So it is really the string conversion that fails. Roughly the same
happens with

class X:
def __str__(self):
return -1

Here, instances of X also cannot be printed: str() is really supposed
to return a byte string object - not a number, not a unicode object.
As a special exception, __str__ can return a Unicode object, as long
as that result can be converted with the system default encoding into
a byte string object. So we really have

def str(o):
if isinstance(o, types.StringType): return o
if isinstance(o, types.UnicodeType): return o.encode(None)
return str(o.__str__())

This is why the first print succeeds (it calls __str__ directly,
printing the Unicode object afterwards), and the second print fails
(trying to str()-convert its argument, which already fails - it
didn't get so far as to actually trying to print something).

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.

Yes, that should happen in P3k. But even then, there will be a
distinction between byte (plain) strings, and character (unicode)
strings.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 22, 2004

Bengt said:
So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?

[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a
Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Regards,
Martin

Bengt Richter · Nov 23, 2004

Bengt said:
Bengt said:

So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?

Click to expand...

[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
bytestring.decode('some_unknown_encoding').encode(sys.stdout.encoding)
has already been done, it seems (I'm not arguing against).

Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Yes, I think my turgid post did demonstrate that, among other things ;-)

So how about changing print so that it doesn't blindly use str(y), but instead
first tries to get y.__str__() in case the latter returns unicode?
Then print y can succeed the way print y.__str__() does now.

The same goes for str.__mod__ -- it apparently knows how to deal with '%s'% unicode(y)
so why shouldn't '%s'%y benefit when y.__str__ returns unicode?

I.e., str doesn't know that printing and '%s' can use unicode to good effect
if it available, so for print and str.__mod__ blindly to use str() intermediately
throws away an opportunity to do better ISTM.

Regards,
Bengt Richter

Steve Holden · Nov 23, 2004

Bengt said:
Bengt said:

So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?

Click to expand...

[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a

Click to expand...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
bytestring.decode('some_unknown_encoding').encode(sys.stdout.encoding)
has already been done, it seems (I'm not arguing against).

Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Click to expand...

Yes, I think my turgid post did demonstrate that, among other things ;-)

So how about changing print so that it doesn't blindly use str(y), but instead
first tries to get y.__str__() in case the latter returns unicode?
Then print y can succeed the way print y.__str__() does now.

The same goes for str.__mod__ -- it apparently knows how to deal with '%s'% unicode(y)
so why shouldn't '%s'%y benefit when y.__str__ returns unicode?

I.e., str doesn't know that printing and '%s' can use unicode to good effect
if it available, so for print and str.__mod__ blindly to use str() intermediately
throws away an opportunity to do better ISTM.

Regards,
Bengt Richter

Am I the only person who found it scary that Bengt could apparently
casually drop on a polynomial the would decode to " Löwis"?

feel-dumb-just-being-in-the-same-newsgroup-ly y'rs - steve

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 23, 2004

Bengt said:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
bytestring.decode('some_unknown_encoding').encode(sys.stdout.encoding)
has already been done, it seems (I'm not arguing against).

Not really. sys.stdout really is a byte string, which may or may
not *have* an encoding. Python tries to guess, and refuses to
in the face of ambiguity: e.g. if sys.stdout is a file, resulting
from

python mkimage.py > image.gif

then sys.stdout really does not *have* an encoding - but it still
is a byte stream. So copying the bytes to stdout is a
straight-forward thing to do.

Of course, "print" should only be used if the stream is meant to
transmit characters, and then the bytes written to the stream should
use the stream's encoding. This is indeed the assumption - but one
that the application author needs to make.

So how about changing print so that it doesn't blindly use str(y)

On the C level, this is already possible, through tp_print. Whether or
not this should be exposed to the Python level (or whether doing so
would just add to the confusion), I don't know.

> but instead
> first tries to get y.__str__() in case the latter returns unicode?
> Then print y can succeed the way print y.__str__() does now.

As yet another alternative, print could invoke unicode(), if
there is a stream encoding. This would try __unicode__first,
then fall back to call __str__. Patches in this direction would
be welcome - but the code implementing print is already quite
involved, so a redesign (with a PEP and everything) might also
be in order.

In P3k, this part of the issue will go away, as str() then will
return Unicode strings.

I.e., str doesn't know that printing and '%s' can use unicode to good effect
if it available, so for print and str.__mod__ blindly to use str() intermediately
throws away an opportunity to do better ISTM.

That is true. Of course, there is already so much backwards
compatibility in this that any change to behaviour (such as
trying unicode() before trying str()) might break things.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 23, 2004

Steve said:
Am I the only person who found it scary that Bengt could apparently
casually drop on a polynomial the would decode to " Löwis"?

I'm not scared, but honored, of course.

Regards,
Martin

Bengt Richter · Nov 29, 2004

Well, don't give me too much credit, though I admit enjoying a little unearned
flattered-ego buzz ;-) But it's not a big deal if you had recently implemented
an automatic lambda-printer-outer to solve for a polynomial function f such that
f(0)==k0, f(1)==k1, .. f(n)==kn. For a single number k0 that will be lambda x: k0
and for two numbers k0, k1 will be lambda x: k0 + x*(k1-k0) etc. It's a matter of
solving some simultaneous equations for the coefficient values, which I had done
in response to a previous thread. For that, I happened to have had some experience
from the '60s writing variations on an equation solver (back when we congratulated
ourselves on getting all (software-implemented) floating point ops other than divide
to execute in under a millisecond ;-) Here I was using an exact decimal module I happened
to have (also built in response to previous thread discussion ;-), so I didn't even have
to look for maximum abs pivot elements in the matrix for this one. And it didn't have to be fast.
So it was kind of a fun exercise. But anyway, it was all ready to go at this point, so
all I had to was do was run coeffsx.py with the character ord values as args on the command line.
The opportunity to use it in a fun way to fake casual wizardry was just dumb luck ;-)

I'm not scared, but honored, of course.

A bit late responding, but I couldn't think of a clever followup to that ;-)
But Just to play fair,

print ''.join([chr((lambda x: (
-6244372133*x**31 +3013910052086*x**30 -695396351572920*x**29
+102105752307741620*x**28 -10715303804974659632*x**27 +855734314951919397204*x**26
-54067713339116101354860*x**25 +2774121296568607137441900*x**24
-117725625258165396333623970*x**23 +4187405270602160539007125440*x**22
-126060225187601954901807327900*x**21 +3234908736910295469078183101700*x**20
-71121878980966418114205095297640*x**19 +1344268902923717571167117226451980*x**18
-21886601404074660751245403749948900*x**17 +307180698948793841846368910776059300*x**16
-3714719218772170154406066269371644945*x**15 +38641327091060849304069885597725238090*x**14
-344757809926306996671359721670334393500*x**13 +2627069115710241704477921121071756668600*x**12
-16998869426095431823754237370045113150352*x**11 +92697362475995606001274610327169882407584*x**10
-421837211162827653880286870838716820642880*x**9 +1581695033356657201434736494281105646218880*x**8
-4805817748883837636614530805204695373091328*x**7 +11572394080794032785251889126742747327087616*x**6
-21417820944419013080374525134500006003159040*x**5 +29141767437911436346798089144038222112768000*x**4
-27186086428826094346108431447644781404160000*x**3 +15339943556592952236643053124047771402240000*x**2
-3882253738078295379102517100266822041600000*x +230239482316981838896315760640000000)
/2740946218059307605908520960000000
)(x)) for x in xrange(32)])

Not-ready-to-be-mythologized-though-plenty-flatterable-ly y'rs

Regards,
Bengt Richter

convert Unicode filenames to good-looking ASCII	3	May 6, 2010
Question about encoding, I need a clue ...	2	Aug 5, 2011
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008
Mindboggling Scope Issue	0	Oct 24, 2004
Old Paranoia Game in Python	15	Jan 9, 2005
anybody help me	1	Feb 10, 2006
python-dev Summary for 2006-02-16 through 2006-02-28	1	Apr 29, 2006
[ANN] JRuby 1.1RC2 Released	1	Feb 16, 2008

unicode question

wolfgang haefelinger

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Kent Johnson

Guest

wolfgang haefelinger

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

wolfgang haefelinger

Bengt Richter

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt Richter

Steve Holden

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads