Problem with national characters

L

Leif B. Kristensen

I'm developing a routine that will parse user input. For simplicity, I'm
converting the entire input string to upper case. One of the words that
will have special meaning for the parser is the word "før", (before in
English). However, this word is not recognized. A test in the
interactive shell reveals this:

leif@balapapa leif $ python
Python 2.3.4 (#1, Feb 7 2005, 21:31:38)
[GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
In Windows, the result is slightly different, but no better:

C:\Python23>python
ActivePython 2.3.2 Build 232 (ActiveState Corp.) based on
Python 2.3.2 (#49, Nov 13 2003, 10:34:54) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
Is there a way around this problem? My character set in Linux is
ISO-8859-1. In Windows 2000 it should be the equivavent Latin-1, though
I'm not sure about which character set the command shell is using.
 
L

Leif B. Kristensen

put

import sys
sys.setdefaultencoding('UTF-8')

into sitecustomize.py in the top level of your PYTHONPATH .

Uh ... it doesn't seem like I've got PYTHONPATH defined on my system in
the first place:

leif@balapapa leif $ env |grep -i python
PYTHONDOCS=/usr/share/doc/python-docs-2.3.4/html

I ran this small snippet that I found after a search on Gentoo-forums:
....

/usr/lib/python23.zip
/usr/lib/python2.3
/usr/lib/python2.3/plat-linux2
/usr/lib/python2.3/lib-tk
/usr/lib/python2.3/lib-dynload
/usr/lib/portage/pym
/usr/lib/python2.3/site-packages
/usr/lib/python2.3/site-packages/gtk-2.0

What should my PYTHONPATH look like, and where do you suggest to put the
sitecustomize.py file?
 
L

Leif B. Kristensen

I found out of it, sort of. Now I've got a PYTHONPATH that points to my
home directory, and followed your instructions. The first time I got an
error message due to a typo. I corrected it, and now Python starts
without an error message. But it didn't solve my problem with the
uppercase Ø at all. Is there something else I have to do?
 
L

Leif B. Kristensen

Leif B. Kristensen skrev:
Is there something else I have to do?

Please forgive me for talking with myself here :) I should have looked
up Unicode in "Learning Python" before I asked. This seems to work:
'F\xd8R'

So far, so good. Note that the Unicode representation of the uppercase
version is identical to the default. But when I try the builtin
function unicode(), weird things happen:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2:
invalid data

The ActivePython 2.3.2 doesn't even seem to understand the 'u' prefix.
So even if I can get this to work on my own Linux machine, it hardly
looks like a portable solution.

Seems like the "solution" is to keep away from letters above ASCII-127,
like we've done since the dawn of computing ...
 
M

Max M

Leif said:
Is there a way around this problem? My character set in Linux is
ISO-8859-1. In Windows 2000 it should be the equivavent Latin-1, though
I'm not sure about which character set the command shell is using.

The unicode methods seems to do it correctly. So you can decode your
strings as unicode, do the transfom, and encode it back as latin1.

print repr('før'.decode('latin-1').upper().encode('latin-1')) #
'F\xd8R'


print repr('FØR'.decode('latin-1').encode('latin-1'))
'F\xd8R'

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Leif said:
Is there a way around this problem? My character set in Linux is
ISO-8859-1. In Windows 2000 it should be the equivavent Latin-1, though
I'm not sure about which character set the command shell is using.

You need to do locale.setlocale(locale.LC_ALL, "") to get
locale-specific upper-casing.

Notice that things are more difficult in the Windows terminal window,
as this uses an encoding different from the one that the system's
locale functions expect.

Regards,
Martin
 
L

Leif B. Kristensen

"Martin v. Löwis" skrev:
You need to do locale.setlocale(locale.LC_ALL, "") to get
locale-specific upper-casing.

That makes a lot of sense. Thank you.
'F\xd8R'

I must make a note of the LC_ALL variable in the installation README.
I for one have been running Gentoo Linux for two years without ever
setting the locale, - but now I've finally gotten around to write my
own /etc/env.d/02locale file :)
Notice that things are more difficult in the Windows terminal window,
as this uses an encoding different from the one that the system's
locale functions expect.

The real input will come from a GUI or a browser interface, so the
Windows terminal problem isn't really an issue.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top