Printing UTF-8

S

sheldon.regular

I am new to unicode so please bear with my stupidity.

I am doing the following in a Python IDE called Wing with Python 23.
äöü

Why can't I get äöü printed from utf-8 and I can from latin1? How
can I use utf-8 exclusivly and be able to print the characters?

I also did the same thing an the same machine in a command window...
ActivePython 2.3.2 Build 230 (ActiveState Corp.) based on
Python 2.3.2 (#49, Oct 24 2003, 13:37:57) [MSC v.1200 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byteTraceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byte
Why such a difference from the IDE to the command window in what it can
do and the internal representation of the unicode?

Thanks,
Shel
 
J

John Machin

I am new to unicode so please bear with my stupidity.

I am doing the following in a Python IDE called Wing with Python 23.
From later evidence, this string is encoded as utf-8. Looks like Wing
must be using an implicit "# coding: utf-8" for interactive input ...
äöü

.... but uses some other encoding for output. Try doing this, and see
what you get:
import sys
print sys.stdout.encoding
'\xc3\xa4\xc3\xb6\xc3\xbc'

Yup, looks like utf-8 ...
u'\xe4\xf6\xfc'

Yup, decodes from utf-8 without error
u'\xe4\xf6\xfc'

and those Unicode characters actually look like what you started with:

| >>> import unicodedata as ucd
| >>> [ucd.name(x) for x in u'\xe4\xf6\xfc']
| ['LATIN SMALL LETTER A WITH DIAERESIS', 'LATIN SMALL LETTER O WITH
DIAERESIS',
| LATIN SMALL LETTER U WITH DIAERESIS']
| >>>

So, 3 yups, it must be utf-8.

äöü

Why can't I get äöü printed from utf-8 and I can from latin1?

Because str objects are just strings of anonymous bytes. They don't
have an attribute that says what encoding their creator had in mind.
Consequently output channels like stdout have an encoding which is
applied to all output. On Windows, in a GUI, this encoding depends on
your locale, and in your case is probably cp1252. cp1252 is very
similar to latin1 but has extra symbols in it. Try repeating the above
exercise, but this time include a trademark symbol in your s string,
and add
print u.encode("cp1252")
at the end of the exercise.
How
can I use utf-8 exclusivly and be able to print the characters?

print exclusiveutf8.decode('utf-8').encode(whateverittakes)

Why do you want to use utf-8 exclusively? Use it for what?

Basic principle when working with non-ASCII data: decode 8-bit input
into Unicode; process using Unicode-aware software (in Python's case,
the built-in unicode type); if 8-bit output is required, encode your
Unicode data with whatever encoding is required.
I also did the same thing an the same machine in a command window...
ActivePython 2.3.2 Build 230 (ActiveState Corp.) based on
Python 2.3.2 (#49, Oct 24 2003, 13:37:57) [MSC v.1200 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byteTraceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
unexpected code byte
Why such a difference from the IDE to the command window in what it can
do

Because the command window is the child of MS-DOS, which was the child
of CP/M, and maintains the ancient traditions (like ctrl-Z being taken
as EOF, for example).
and the internal representation of the unicode?

Unicode? There's no Unicode involved here. In each case you are sending
a string of bytes (0 <= ordinal <= 255) to an output device, each to be
rendered as a bitmap on the screen. Wing evidently causes the renderer
to reach for the latin1 or cp1252 table; the command window is probably
(in your case) using cp850 (or something similar).

On my box, in a command window:
| >>> sys.stdout.encoding
| 'cp850'
| >>> '\x84\x94\x81'.decode('cp850')
| u'\xe4\xf6\xfc'
.... which is what you had before.

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top