logging of strings with broken encoding

  • Thread starter Thomas Guettler
  • Start date
T

Thomas Guettler

Hi,

I have bug in my code, which results in the same error has this one:

https://bugs.launchpad.net/bzr/+bug/295653
{{{
Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 765, in emit
self.stream.write(fs % msg.encode("UTF-8"))
..
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 8: ordinal not in range(128)
}}}

I run Python 2.6. In SVN the code is the same (StreamHandler ... def emit...):
http://svn.python.org/view/python/b...ogging/__init__.py?revision=72507&view=markup

I think msg.encode("UTF-8", 'backslashreplace') would be better here.

What do you think?

Should I fill a bugreport?

Thomas
 
D

David Smith

Thomas said:
Hi,

I have bug in my code, which results in the same error has this one:

https://bugs.launchpad.net/bzr/+bug/295653
{{{
Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 765, in emit
self.stream.write(fs % msg.encode("UTF-8"))
..
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 8: ordinal not in range(128)
}}}

I run Python 2.6. In SVN the code is the same (StreamHandler ... def emit...):
http://svn.python.org/view/python/b...ogging/__init__.py?revision=72507&view=markup

I think msg.encode("UTF-8", 'backslashreplace') would be better here.

What do you think?

Should I fill a bugreport?

Thomas

I think you have to decode it first using the strings original encoding
whether that be cp1252 or mac-roman or any of the other 8-bit encodings.
Once that's done, you can encode in UTF-8

--David
 
P

Peter Otten

Thomas said:
I have bug in my code, which results in the same error has this one:

https://bugs.launchpad.net/bzr/+bug/295653
{{{
Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 765, in emit
self.stream.write(fs % msg.encode("UTF-8"))
..
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 8:
ordinal not in range(128) }}}

I run Python 2.6. In SVN the code is the same (StreamHandler ... def
emit...):
http://svn.python.org/view/python/branches/release26- maint/Lib/logging/__init__.py?revision=72507&view=markup

I think msg.encode("UTF-8", 'backslashreplace') would be better here.

What do you think?

That won't help. It's a *decoding* error. You are feeding it a non-ascii
byte string.

Peter
 
L

Lie Ryan

Thomas said:
Hi,

I have bug in my code, which results in the same error has this one:

https://bugs.launchpad.net/bzr/+bug/295653
{{{
Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 765, in emit
self.stream.write(fs % msg.encode("UTF-8"))
..
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 8: ordinal not in range(128)
}}}

What's the encoding of self.stream? Is it sys.stdout/sys.stderr or a
file object?
 
T

Thomas Guettler

My quick fix is this:

class MyFormatter(logging.Formatter):
def format(self, record):
msg=logging.Formatter.format(self, record)
if isinstance(msg, str):
msg=msg.decode('utf8', 'replace')
return msg

But I still think handling of non-ascii byte strings should be better.
A broken logging message is better than none.

And, if there is a UnicodeError, handleError() should not send the message
to sys.stderr, but it should use emit() of the current handler.

In my case sys.stderr gets discarded. Its very hard to debug, if you don't
see any logging messages.

Thomas
 
S

Stefan Behnel

Thomas said:
My quick fix is this:

class MyFormatter(logging.Formatter):
def format(self, record):
msg=logging.Formatter.format(self, record)
if isinstance(msg, str):
msg=msg.decode('utf8', 'replace')
return msg

But I still think handling of non-ascii byte strings should be better.
A broken logging message is better than none.

Erm, may I note that this is not a problem in the logging library but in
the code that uses it? How should the logging library know what you meant
by passing that byte string in the first place? And where is the difference
between accidentally passing a byte string and accidentally passing another
non-printable object? Handling this "better" may simply hide the bugs in
your code, I don't find that's any "better" at all.

Anyway, this has been fixed in Py3.

Stefan
 
L

Lie Ryan

Thomas said:
My quick fix is this:

class MyFormatter(logging.Formatter):
def format(self, record):
msg=logging.Formatter.format(self, record)
if isinstance(msg, str):
msg=msg.decode('utf8', 'replace')
return msg

But I still think handling of non-ascii byte strings should be better.
A broken logging message is better than none.

The problem is, python 2.x assumed the default encoding of `ascii`
whenever you don't explicitly mention the encoding, and your code
apparently broke with that assumption. I haven't looked at your code,
but others have suggested that you've fed the logging module with
non-ascii byte strings. The logging module can only work with 1) unicode
string, 2) ascii-encoded byte string

If you want a quick fix, you may be able to get away with repr()-ing
your log texts. A proper fix, however, is to pass a unicode string to
the logging module instead.
Traceback (most recent call last):
File "/usr/lib64/python2.6/logging/__init__.py", line 773, in emit
stream.write(fs % msg.encode("UTF-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 13:
ordinal not in range(128)WARNING:root:Ñ‹
 
T

Thomas Guettler

Stefan said:
Erm, may I note that this is not a problem in the logging library but in
the code that uses it?

I know that my code passes the broken string to the logging module. But maybe
I get the non-ascii byte string from a third party (psycopg2 sometime passes
latin1 byte strings from postgres in error messages).

I like Python very much because "it refused to guess". But in this case, "best effort"
is a better approach.

It worked in 2.5 and will in py3k. I think it is a bug, that it does not in 2.6.

Thomas
 
L

Lie Ryan

Thomas said:
I know that my code passes the broken string to the logging module. But maybe
I get the non-ascii byte string from a third party (psycopg2 sometime passes
latin1 byte strings from postgres in error messages).

If the database contains non-ascii byte string, then you could repr()
them before logging (repr also adds some niceties such as quotes). I
think that's the best solution, unless you want to decode the byte
string (which might be an overkill, depending on the situation).
I like Python very much because "it refused to guess". But in this case, "best effort"
is a better approach.

One time it refused to guess, then the next time it tries best effort. I
don't think Guido liked such inconsistency.
It worked in 2.5 and will in py3k. I think it is a bug, that it does not in 2.6.

In python 3.x, the default string is unicode string. If it works in
python 2.5, then it is a bug in 2.5
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top