How to display unicode with the CGI module?

coldpizza · Nov 24, 2007

Hi!

I am using the built-in Python web server (CGIHTTPServer) to serve
pages via CGI.
The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".

My question is: (1 ) how and (2) where do I set the encoding for the
page?

I have tried adding <meta http-equiv="content-type" content="text/
html; charset=utf-8" /> but this does not seem to help, as this is an
instruction for the browser, not for the webserver and/or CGI script.

Do I have to set the encoding in the server script? On in the Python
CGI script?

The data that I want to display comes from a sqlite3 database and is
already in Unicode format.

The webserver script looks like this:

Code:

#
import CGIHTTPServer, BaseHTTPServer
httpd=BaseHTTPServer.HTTPServer(('',8080),
CGIHTTPServer.CGIHTTPRequestHandler)
httpd.serve_forever()
#

A simplified version of my Python CGI script would be:

Code:

import cgi

print "text/html"
print

print "<html>"
print " <body>"
print   "my UTF8 string: FranÃ§ais æ—¥æœ¬èªž EspaÃ±ol PortuguÃªs RomÃ¢nÄƒ"
print " </body>"
print "</html>"

Where and what do I need to add to these scripts to get proper display
of UTF8 content?

Marc 'BlackJack' Rintsch · Nov 25, 2007

The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".

Unicode != UTF-8. You are not trying to send an UTF-8 encoded byte string
but an *unicode string*. That's not possible. If unicode strings should
"leave" your program they must be encoded into byte strings. If you don't
do this explicitly Python tries to encode as ASCII and fails if there's
anything non-ASCII in the string. The `encode()` method is your friend.

Ciao,
Marc 'BlackJack' Rintsch

paul · Nov 25, 2007

Marc said:
Unicode != UTF-8. You are not trying to send an UTF-8 encoded byte string
but an *unicode string*.

Just to expand on this... It helps thinking of "unicode objects" and
"strings" as seperate types (which they are). So there is no such thing
like "unicode string" and you always need to think about when to
encode() your unicode objects. However, this will change in py3k...,
what's the new rule of thumb?

cheers
Paul

coldpizza · Nov 25, 2007

Unicode != UTF-8.
....

`encode()` method is your friend.

Thanks a lot for help!

I am always confused as to which one to use: encode() or decode(); I
have initially tried decode() and it did not work.

It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?), which makes it far less readable than a
s.recode('ucs2','utf8').

Another wierd thing is that by default Python converts internal
Unicode to ascii. Will it be the same in Py3k? string*.

Jan Claeys · Nov 26, 2007

Op Sun, 25 Nov 2007 13:02:26 -0800, schreef coldpizza:

It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?), which makes it far less readable than a
s.recode('ucs2','utf8').

The internal encoding/representation of a "string" of Unicode characters
is considered an implementation detail and is in fact not always the same
(e.g. a cpython build parameter selects UCS2 or UCS4, and it might be
something else in other implementations).

See the 'Py_UNICODE' paragraph in:
<http://docs.python.org/api/unicodeObjects.html>

greg · Nov 26, 2007

paul said:
However, this will change in py3k...,
what's the new rule of thumb?

In py3k, the str type will be what unicode is now, and there
will be a new type called bytes for holding binary data --
including text in some external encoding. These two types
will not be compatible.

At the lowest level, reading a file will return bytes, which
then have to be decoded to produce a (unicode) str, and a str
will have to be encoded into bytes before being written to a
file.

There will be wrappers for text files that perform the
decoding and encoding automatically, but they will need to
be set up to use a specified encoding if you're dealing
with anything other than ascii. (It may be possible to
set up a system-wide default, I'm not sure.)

So you won't be able to get away with ignoring encoding
issues in py3k. On the plus side, it should all be handled
in a much more consistent and less error-prone way. If
you mistakenly try to use encoded data as though it were
decoded data or vice versa, you'll get a type error.

greg · Nov 26, 2007

coldpizza said:
I am always confused as to which one to use: encode() or decode();

In unicode land, an "encoding" is a method of representing
unicode data in an external format. So you encode unicode
data in order to send it into the outside world, and you
decode it in order to turn it back into unicode data.

It'll be easier to get right in py3k, because bytes will only have
a decode() method and str will only have an encode() method.

It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?),

Unicode objects don't *have* an encoding. UCS2 is not an encoding,
it's an internal storage format. You're not supposed to need to know
or care about it, and it could be different between different
Python builds.

Another wierd thing is that by default Python converts internal
Unicode to ascii.

It's the safest assumption. Python is refusing the temptation
to guess the encoding of anything outside the range 0-127 if you
don't tell it.

paul · Nov 26, 2007

greg said:
paul said:

However, this will change in py3k...,
what's the new rule of thumb?

Click to expand...

[snipp]

So you won't be able to get away with ignoring encoding
issues in py3k. On the plus side, it should all be handled
in a much more consistent and less error-prone way. If
you mistakenly try to use encoded data as though it were
decoded data or vice versa, you'll get a type error.

Thanks for your detailed answer. In fact, having encode() only for <str>
and decode() for <byte> will simplify things a lot. I guess implicit
encode() of <str> when using print() will stay but having utf-8 as the
new default encoding will reduce the number of UnicodeError. You'll get
weird characters instead

cheers
Paul

AttributeError: partially initialized module 'cgi' has no attribute 'FieldStorage' (most likely due	0	May 17, 2020
Unicode	20	Dec 16, 2012
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Tornado with cgi form	0	Apr 17, 2013
Data transfer from Python CGI to javascript	1	Aug 9, 2013
small python cgi webserver	6	Nov 4, 2006
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Right solution to unicode error?	21	Nov 7, 2012

How to display unicode with the CGI module?

coldpizza

Marc 'BlackJack' Rintsch

paul

coldpizza

Jan Claeys

greg

greg

paul

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads