Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

Oleg Parashchenko · Mar 29, 2007

Hello,

I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)

I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:

http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

Paul Boddie · Mar 29, 2007

Hello,

I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)

What does sys.stdout.encoding say?

I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:

http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

Calling sys.setdefaultencoding might not even help in this case, and
the consensus is that it may be harmful to your code's portability
[1]. Writing output to a terminal may be influenced by your locale,
but I'm not convinced that going through all the locale settings and
setting the character set is the best approach (or even the right
one).

What do you get if you do this...?

import locale
locale.setlocale(locale.LC_ALL, "")
print locale.getlocale()

What is your terminal encoding?

Usually, if I'm wanting to print Unicode objects, I explicitly encode
them into something I know the terminal will support. The codecs
module can help with writing Unicode to streams in different
encodings, too.

Paul

[1] http://groups.google.com/group/comp.lang.python/msg/431017a4cb4bb8ea

Oleg Parashchenko · Mar 31, 2007

Hello,

Hello,

Click to expand...

I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:

Click to expand...

UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)

Click to expand...

What does sys.stdout.encoding say?
'KOI8-R'

I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:

Click to expand...

http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

Click to expand...

Calling sys.setdefaultencoding might not even help in this case, and
the consensus is that it may be harmful to your code's portability
[1].

Yes, but I think UTF-8 is now everywhere.

Writing output to a terminal may be influenced by your locale,
but I'm not convinced that going through all the locale settings and
setting the character set is the best approach (or even the right
one).

What do you get if you do this...?

import locale
locale.setlocale(locale.LC_ALL, "")
print locale.getlocale()

('ru_RU', 'koi8-r')

What is your terminal encoding?
koi8-r

Usually, if I'm wanting to print Unicode objects, I explicitly encode
them into something I know the terminal will support. The codecs
module can help with writing Unicode to streams in different
encodings, too.

As long as input/output is the only place for such need, it's ok to
encode expliciyely. But I also had problems, for example, with md5
module, and I don't know the whole list of potential problematic
places. Therefore, I'd better go with my brutal utf8ization.

Paul

[1]http://groups.google.com/group/comp.lang.python/msg/431017a4cb4bb8ea

Jarek Zgoda · Mar 31, 2007

Oleg Parashchenko napisa³(a):

I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:
http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

Click to expand...

Calling sys.setdefaultencoding might not even help in this case, and
the consensus is that it may be harmful to your code's portability
[1].

Click to expand...

Yes, but I think UTF-8 is now everywhere.

No, it is not. Your own system is "not ready for UTF-8", as you stated
somewhere in this blog entry. How can you expect everybody else's system
being utf-8, while "you are not ready for transition"?

It would be better if you write your programs in encoding-agnostic way,
using byte streams only for input and output (yes, printing a debug
statement on terminal *is* a kind of producing the output). An, oh, you
cann't encode/decode text not knowing the encoding...

Paul Boddie · Mar 31, 2007

I think I've found the actual source of this, and it isn't the print
statement. UnicodeDecodeError relates to the construction of Unicode
objects, not the encoding of such objects as byte strings. The
terminology is explained using this simple diagram (which hopefully
won't be ruined in transmission):

byte string in XYZ encoding
|
(decode from XYZ) --> possible UnicodeDecodeError
|
V
Unicode object
|
(encode to ABC) --> possible UnicodeEncodeError
|
V
byte string in ABC encoding

What does sys.stdout.encoding say?
'KOI8-R'

Click to expand...

[...]

What do you get if you do this...?

import locale
locale.setlocale(locale.LC_ALL, "")
print locale.getlocale()

Click to expand...

('ru_RU', 'koi8-r')

What is your terminal encoding?

Click to expand...

koi8-r

Here's a transcript on my system answering the same questions:

Python 2.4.1 (#2, Oct 4 2006, 16:53:35)
[GCC 3.3.5 (Debian 1:3.3.5-8ubuntu2.1)] on linux2
Type "help", "copyright", "credits" or "license" for more
information. ('en_US', 'iso-8859-15')

So Python knows about the locale. Note that neither of us use UTF-8 as
a system encoding.
'ISO-8859-15'

This tells us that Python could know things about writing Unicode
objects out in the appropriate encoding. I wasn't sure whether Python
was so smart about this, so let's see what happens...
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position
0: ordinal not in range(128)

Now this isn't anything to do with the print operation: what's
happening here is that I'm explicitly making a Unicode object but
haven't said what the encoding of my byte string is. The default
encoding is 'ascii' as stated in the error message. None of the
characters provided belong to the ASCII character set.

We can check this by not printing anything out:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position
0: ordinal not in range(128)

So, let's try again and provide an encoding...
æøå

Here, we've mentioned the encoding and even though the print statement
is acting on a Unicode object, it seems to be happy to work out the
resulting encoding.
æøå

Here, we've skipped the explicit Unicode object construction by using
a Unicode literal, which works in this simple case.

Of course, if your system encoding (along with the terminal) isn't
capable of displaying every Unicode character, you'll experience
problems doing the above. Frequently, it's interesting to encode
things as UTF-8 and look at them in applications that are capable of
displaying the text. Thus, you'd do something like this:

import unicodedata

(This gets an interesting function to help us look up characters in
the Unicode database.)

somefile = open("somefile.txt", "wb")
print >>somefile, unicodedata.lookup("MONGOLIAN VOWEL
SEPARATOR").encode("utf-8")

Or even this:

import codecs
somefile = codecs.open("somefile.txt", "wb", encoding="utf-8")
print >>somefile, unicodedata.lookup("MONGOLIAN VOWEL SEPARATOR")

Here, we only specified the encoding once when opening the file. The
file object accepts Unicode objects thereafter.

As long as input/output is the only place for such need, it's ok to
encode expliciyely. But I also had problems, for example, with md5
module, and I don't know the whole list of potential problematic
places. Therefore, I'd better go with my brutal utf8ization.

It's best to decode (ie. construct Unicode objects) upon receiving
data as input, and to encode (ie. convert Unicode objects to byte
strings) upon producing output. What may be the problem with the md5
module, and you'd have to post example code for us to help you out, is
that it assumes byte strings and doesn't work properly with Unicode
objects, but I can't say for sure because I'm usually presenting byte
strings to md5 module functions on the rare occasions I do anything
with them. Note that one would usually calculate MD5 checksums on raw
data, although I can imagine a hypothetical (although perhaps
unrealistic) need to do so on Unicode text, so it doesn't necessarily
make much sense to present those functions with Unicode data.

Paul

[2.5.1] "UnicodeDecodeError: 'ascii' codec can't decode byte"?	3	Oct 29, 2008
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
UnicodeDecodeError: 'ascii' codec can't decode byte	2	Jun 17, 2008
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position	4	Dec 6, 2012
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in	0	Jan 29, 2009
UnicodeDecodeError	0	Jul 21, 2007
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 10: ordinal not in range(128)	4	Oct 8, 2004

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

Oleg Parashchenko

Paul Boddie

Oleg Parashchenko

Jarek Zgoda

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads