sys.stdout, urllib and unicode... I don't understand.

T

Thierry

Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode
After that, an urllib request is sent with this encoded string to the
web service
First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:I end up with an UnicodeDecodeError. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

Second problem, if I try aninto the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/python-list/2007-October/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeErro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.

Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw
 
T

Tino Wildenhain

Thierry said:
Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode

urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.encode("utf-8")
After that, an urllib request is sent with this encoded string to the
web service

First problem, how to determine the encoding of the return ?

It is sent as part of the headers. e.g. content-type: text/html;
charset=utf-8
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
I end up with an UnicodeDecodeError. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

web server answer is encoded byte stream too (usually utf-8 but you
can check the headers) so

line.decoce("utf-8") should give you unicode to operate on (always
do string operations on canonized form)
Second problem, if I try an
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.

But it is what it does. Basically unicode() is a constructor for
unicode objects.
Here again, trying several normalize/decode combination did not helped
at all.

Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode()
and encode() )
Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/python-list/2007-October/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:

This is fancy but not needed if you take care like above.

HTH
Tino
 
M

Marc 'BlackJack' Rintsch

I have realized an wxPython simple application, that takes the input of
a user, send it to a web service, and get back translations in several
languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode

If the service uses UTF-8 why don't you just encode the data you send as
UTF-8 but Latin-1 with potentially throwing away data because of the
'ignore' argument!? Make that ``src_text = unicodedata.encode('utf-8')``
First problem, how to determine the encoding of the return ? If I
inspect a request from firefox, I see that the server return header
specify UTF-8
But if I use this code:
I end up with an UnicodeDecodeError.

Because `line` contains bytes and `ret` is a `unicode` object. If you
add a `unicode` object and a `str` object, Python tries to convert the
`str` to `unicode` using the default == ASCII encoding. And this fails
if there are byte value >127. *You* have to decode `line` from a bunch
of bytes to a bunch of (unicode)characters before you concatenate the
strings.

BTW: ``line.strip()`` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore. And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

Ciao,
Marc 'BlackJack' Rintsch
 
T

Thierry

Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have anand no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
 
S

Steve Holden

Thierry said:
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.

Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
In that case you would need the following code:

ret=U''
for line in req:
ret=ret+string.replace(line.strip(),'\\n', '\n')

Otherwise you just replace chr(10)'s with chr(10)'s, which won't help you.

Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.

regards
Steve
 
T

Thierry

Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?

Yes, I am sure. Because I dumped the ord() values to check them.
But again, I'm stumped on how complicated I have made this.
I should not try to code anymore at 2am.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,733
Latest member
LonaMonzon

Latest Threads

Top