Can I make unicode in a repr() print readably?

T

Terry Hancock

I still run into my own ignorance a lot with unicode in Python.

Is it possible to define some combination of __repr__, __str__,
and/or __unicode__ so that the unicode() wrapper isn't necessary
in this statement:
>>> print unicode(jp.concepts['adjectives']['BLUE'][0])
<GLOSS: é’ã„, cl=None, {'wd': u'\u9752\u3044'}>

(i.e. can I make it so that the object that print gets is already
unicode, so that the label 'é’ã„' will print readably?)



Or, put another way, what exactly does 'print' do when it gets
a class instance to print? It seems to do the right thing if
given a unicode or string object, but I cant' figure out how to
make it do the same thing for a class instance.

I guess it would've seemed more intuitive to me if print attempted
to use __unicode__() first, then __str__(), and then __repr__(). But
it apparently skips straight to __str__(), unless the object is already
a unicode object. (?)



The following doesn't bother me:
>>> jp.concepts['adjectives']['BLUE'][0]
<GLOSS: \u9752\u3044, cl=None, {'wd': u'\u9752\u3044'}>

And I understand that I might want that if I'm working in
an ASCII-only terminal. But it's a big help to be able to
read/recognize the labels when I'm working with localized
encodings, and I'd like to save the extra typing if I'm
going to be looking at a lot of these

So far, I've tried overriding the __unicode__ method to return
the unicode representation (doesn't seem like print calls it,
though), and I've tried returning the same thing from __repr__,
but the latter causes this unpleasant result:
>>> print jp.concepts['adjectives']['BLUE'][0]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
8-9: ordinal not in range(128)

so I don't think I want to do that.

Advice?

Terry
 
G

Guest

Terry said:
Is it possible to define some combination of __repr__, __str__,
and/or __unicode__ so that the unicode() wrapper isn't necessary
in this statement:

I'm not aware of a way of doing so.
Or, put another way, what exactly does 'print' do when it gets
a class instance to print? It seems to do the right thing if
given a unicode or string object, but I cant' figure out how to
make it do the same thing for a class instance.

It won't. PyFile_WriteObject checks for Unicode objects, and whether
the file has an encoding attribute set, and if so, encodes the
Unicode object.

If it is not a Unicode object, it falls through to PyObject_Print,
which first checks for the tp_print slot (which can't be set in
Python), then uses PyObject_Str (which requires that the __str__
result is a true byte string), or PyObject_Repr (if the RAW
flag isn't set - it is when printing). PyObject_Str first checks
for tp_str; if that isn't set, it falls back to PyObject_Repr.
And I understand that I might want that if I'm working in
an ASCII-only terminal. But it's a big help to be able to
read/recognize the labels when I'm working with localized
encodings, and I'd like to save the extra typing if I'm
going to be looking at a lot of these

You can save some typing, of course, with a helper function:

def p(o):
print unicode(o)

I agree that this is not optimal; contributions are welcome.
It would probably be easiest to drop the guarantee that
PyObject_Str returns a true string, or use _PyObject_Str
(which does not make this guarantee) in PyObject_Print.
One would have to think what the effect on backwards
compatibility is of such a change.

Regards,
Martin
 
T

Terry Hancock

Martin said:
It won't. PyFile_WriteObject checks for Unicode objects, and whether
the file has an encoding attribute set, and if so, encodes the
Unicode object.

If it is not a Unicode object, it falls through to PyObject_Print,
which first checks for the tp_print slot (which can't be set in
Python), then uses PyObject_Str (which requires that the __str__
result is a true byte string), or PyObject_Repr (if the RAW flag
isn't set - it is when printing). PyObject_Str first checks for
tp_str; if that isn't set, it falls back to PyObject_Repr.
You can save some typing, of course, with a helper function:

def p(o): print unicode(o)

Yeah, that's what I've done as it stands. I think it's actually fewer
keystrokes that way, but it is still inconsistent* with other objects,
of course.
I agree that this is not optimal; contributions are welcome. It would
probably be easiest to drop the guarantee that PyObject_Str returns a
true string, or use _PyObject_Str (which does not make this
guarantee) in PyObject_Print. One would have to think what the effect
on backwards compatibility is of such a change.

Ah, contribute to Python itself. I'll have to think about it -- I don't do
a lot of C programming these days, but it sounds like an idea.

I don't know about the backwards compatibility issue. I'm not sure
what would be affected. But "print" frequently generates encoded
Unicode output if the stream supports it, so there is no guarantee
whether it produces unicode or string output now. I think it's clear
that str() *must* return an ordinary Python string.

I think what would make sense is for the "print" statement to attempt
to call __unicode__ on an instance before attempting to call __str__
(just as it currently falls back from __str__ to __repr__). That seems like
it would be pretty consistent, right?

Cheers,
Terry

*Okay, actually it is perfectly consistent in a technical sense, but not in
the utility, "this is what you do to examine the object", sense.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Terry said:
I don't know about the backwards compatibility issue. I'm not sure
what would be affected. But "print" frequently generates encoded
Unicode output if the stream supports it, so there is no guarantee
whether it produces unicode or string output now.

I'm not worried about the code path that print takes - it is obvious
that Unicode objects are allowed to show up, and will cause
UnicodeErrors if encoding them with the stream encoding fails.

I'm (slightly) worried about other code paths that may be affected.
I think it's clear
that str() *must* return an ordinary Python string.

Notice, however, that __str__ may return Unicode objects; those
get silently converted with the system encoding.
I think what would make sense is for the "print" statement to attempt
to call __unicode__ on an instance before attempting to call __str__
(just as it currently falls back from __str__ to __repr__). That seems
like
it would be pretty consistent, right?

This is one option; the other option is that print does not
convert unicode strings returned from __str__ with the system
encoding, but with the stream's encoding. But yes; your approach
might work as well (with the then-incompatibility that __unicode__
will get called in contexts where it wasn't called before).

It will probably be necessary to collect a third and fourth
opinion from python-dev; the actual implementation of whatever
approach gets chosen should be easy. And there should be
documentation changes, of course.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top