How to display unicode with the CGI module?

C

coldpizza

Hi!

I am using the built-in Python web server (CGIHTTPServer) to serve
pages via CGI.
The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".

My question is: (1 ) how and (2) where do I set the encoding for the
page?

I have tried adding <meta http-equiv="content-type" content="text/
html; charset=utf-8" /> but this does not seem to help, as this is an
instruction for the browser, not for the webserver and/or CGI script.

Do I have to set the encoding in the server script? On in the Python
CGI script?

The data that I want to display comes from a sqlite3 database and is
already in Unicode format.

The webserver script looks like this:

Code:
#
import CGIHTTPServer, BaseHTTPServer
httpd=BaseHTTPServer.HTTPServer(('',8080),
CGIHTTPServer.CGIHTTPRequestHandler)
httpd.serve_forever()
#

A simplified version of my Python CGI script would be:
Code:
import cgi

print "text/html"
print

print "<html>"
print " <body>"
print   "my UTF8 string: Français 日本語 Español Português Română"
print " </body>"
print "</html>"

Where and what do I need to add to these scripts to get proper display
of UTF8 content?
 
M

Marc 'BlackJack' Rintsch

The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".

Unicode != UTF-8. You are not trying to send an UTF-8 encoded byte string
but an *unicode string*. That's not possible. If unicode strings should
"leave" your program they must be encoded into byte strings. If you don't
do this explicitly Python tries to encode as ASCII and fails if there's
anything non-ASCII in the string. The `encode()` method is your friend.

Ciao,
Marc 'BlackJack' Rintsch
 
P

paul

Marc said:
Unicode != UTF-8. You are not trying to send an UTF-8 encoded byte string
but an *unicode string*.
Just to expand on this... It helps thinking of "unicode objects" and
"strings" as seperate types (which they are). So there is no such thing
like "unicode string" and you always need to think about when to
encode() your unicode objects. However, this will change in py3k...,
what's the new rule of thumb?

cheers
Paul
 
C

coldpizza

Unicode != UTF-8.
....
`encode()` method is your friend.

Thanks a lot for help!

I am always confused as to which one to use: encode() or decode(); I
have initially tried decode() and it did not work.

It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?), which makes it far less readable than a
s.recode('ucs2','utf8').

Another wierd thing is that by default Python converts internal
Unicode to ascii. Will it be the same in Py3k? string*.
 
J

Jan Claeys

Op Sun, 25 Nov 2007 13:02:26 -0800, schreef coldpizza:
It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?), which makes it far less readable than a
s.recode('ucs2','utf8').

The internal encoding/representation of a "string" of Unicode characters
is considered an implementation detail and is in fact not always the same
(e.g. a cpython build parameter selects UCS2 or UCS4, and it might be
something else in other implementations).

See the 'Py_UNICODE' paragraph in:
<http://docs.python.org/api/unicodeObjects.html>
 
G

greg

paul said:
However, this will change in py3k...,
what's the new rule of thumb?

In py3k, the str type will be what unicode is now, and there
will be a new type called bytes for holding binary data --
including text in some external encoding. These two types
will not be compatible.

At the lowest level, reading a file will return bytes, which
then have to be decoded to produce a (unicode) str, and a str
will have to be encoded into bytes before being written to a
file.

There will be wrappers for text files that perform the
decoding and encoding automatically, but they will need to
be set up to use a specified encoding if you're dealing
with anything other than ascii. (It may be possible to
set up a system-wide default, I'm not sure.)

So you won't be able to get away with ignoring encoding
issues in py3k. On the plus side, it should all be handled
in a much more consistent and less error-prone way. If
you mistakenly try to use encoded data as though it were
decoded data or vice versa, you'll get a type error.
 
G

greg

coldpizza said:
I am always confused as to which one to use: encode() or decode();

In unicode land, an "encoding" is a method of representing
unicode data in an external format. So you encode unicode
data in order to send it into the outside world, and you
decode it in order to turn it back into unicode data.

It'll be easier to get right in py3k, because bytes will only have
a decode() method and str will only have an encode() method.
It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?),

Unicode objects don't *have* an encoding. UCS2 is not an encoding,
it's an internal storage format. You're not supposed to need to know
or care about it, and it could be different between different
Python builds.
Another wierd thing is that by default Python converts internal
Unicode to ascii.

It's the safest assumption. Python is refusing the temptation
to guess the encoding of anything outside the range 0-127 if you
don't tell it.
 
P

paul

greg said:
paul said:
However, this will change in py3k...,
what's the new rule of thumb?
[snipp]

So you won't be able to get away with ignoring encoding
issues in py3k. On the plus side, it should all be handled
in a much more consistent and less error-prone way. If
you mistakenly try to use encoded data as though it were
decoded data or vice versa, you'll get a type error.
Thanks for your detailed answer. In fact, having encode() only for <str>
and decode() for <byte> will simplify things a lot. I guess implicit
encode() of <str> when using print() will stay but having utf-8 as the
new default encoding will reduce the number of UnicodeError. You'll get
weird characters instead ;)

cheers
Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,739
Latest member
Clint8040

Latest Threads

Top