Printing UTF-8

sheldon.regular · Sep 21, 2006

I am new to unicode so please bear with my stupidity.

I am doing the following in a Python IDE called Wing with Python 23.
äöü

Why can't I get äöü printed from utf-8 and I can from latin1? How
can I use utf-8 exclusivly and be able to print the characters?

I also did the same thing an the same machine in a command window...
ActivePython 2.3.2 Build 230 (ActiveState Corp.) based on
Python 2.3.2 (#49, Oct 24 2003, 13:37:57) [MSC v.1200 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byteTraceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byte
Why such a difference from the IDE to the command window in what it can
do and the internal representation of the unicode?

Thanks,
Shel

John Machin · Sep 21, 2006

I am new to unicode so please bear with my stupidity.

I am doing the following in a Python IDE called Wing with Python 23.

From later evidence, this string is encoded as utf-8. Looks like Wing

must be using an implicit "# coding: utf-8" for interactive input ...

Ã¤Ã¶Ã¼

.... but uses some other encoding for output. Try doing this, and see
what you get:
import sys
print sys.stdout.encoding

'\xc3\xa4\xc3\xb6\xc3\xbc'

Yup, looks like utf-8 ...

u'\xe4\xf6\xfc'

Yup, decodes from utf-8 without error

u'\xe4\xf6\xfc'

and those Unicode characters actually look like what you started with:

| >>> import unicodedata as ucd
| >>> [ucd.name(x) for x in u'\xe4\xf6\xfc']
| ['LATIN SMALL LETTER A WITH DIAERESIS', 'LATIN SMALL LETTER O WITH
DIAERESIS',
| LATIN SMALL LETTER U WITH DIAERESIS']
| >>>

So, 3 yups, it must be utf-8.

äöü

Why can't I get äöü printed from utf-8 and I can from latin1?

Because str objects are just strings of anonymous bytes. They don't
have an attribute that says what encoding their creator had in mind.
Consequently output channels like stdout have an encoding which is
applied to all output. On Windows, in a GUI, this encoding depends on
your locale, and in your case is probably cp1252. cp1252 is very
similar to latin1 but has extra symbols in it. Try repeating the above
exercise, but this time include a trademark symbol in your s string,
and add
print u.encode("cp1252")
at the end of the exercise.

How
can I use utf-8 exclusivly and be able to print the characters?

print exclusiveutf8.decode('utf-8').encode(whateverittakes)

Why do you want to use utf-8 exclusively? Use it for what?

Basic principle when working with non-ASCII data: decode 8-bit input
into Unicode; process using Unicode-aware software (in Python's case,
the built-in unicode type); if 8-bit output is required, encode your
Unicode data with whatever encoding is required.

I also did the same thing an the same machine in a command window...
ActivePython 2.3.2 Build 230 (ActiveState Corp.) based on
Python 2.3.2 (#49, Oct 24 2003, 13:37:57) [MSC v.1200 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byteTraceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byte
Why such a difference from the IDE to the command window in what it can
do

Because the command window is the child of MS-DOS, which was the child
of CP/M, and maintains the ancient traditions (like ctrl-Z being taken
as EOF, for example).

and the internal representation of the unicode?

Unicode? There's no Unicode involved here. In each case you are sending
a string of bytes (0 <= ordinal <= 255) to an output device, each to be
rendered as a bitmap on the screen. Wing evidently causes the renderer
to reach for the latin1 or cp1252 table; the command window is probably
(in your case) using cp850 (or something similar).

On my box, in a command window:
| >>> sys.stdout.encoding
| 'cp850'
| >>> '\x84\x94\x81'.decode('cp850')
| u'\xe4\xf6\xfc'
.... which is what you had before.

HTH,
John

MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position	4	Dec 6, 2012
Trouble with utf-8 values	0	Nov 5, 2013
error when printing a UTF-8 string (python 2.6.2)	9	Apr 21, 2010
usage of <string>.encode('utf-8','xmlcharrefreplace')?	7	Feb 19, 2008

Printing UTF-8

sheldon.regular

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads