i18n: looking for expertise

K

klappnase

Hello all,

I am trying to internationalize my Tkinter program using gettext and
encountered various problems, so it looks like it's not a trivial
task.
After some "research" I made up a few rules for a concept that I hope
lets me avoid further encoding trouble, but I would feel more
confident if some of the experts here would have a look at the
thoughts I made so far and told me if I'm still going wrong somewhere
(BTW, the program is supposed to run on linux only). So here is what I
have so far:

1. use unicode instead of byte strings wherever possible. This can be
a little tricky, because in some situations I cannot know in advance
if a certain string is unicode or byte string; I wrote a helper module
for this which defines convenience methods for fail-safe
decoding/encoding of strings and a Tkinter.UnicodeVar class which I
use to convert user input to unicode on the fly (see the code below).

2. so I will have to call gettext.install() with unicode=1

3. make sure to NEVER mix unicode and byte strings within one
expression

4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.

5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before; The filename manipulations by the os.path
methods seem to be simply string manipulations so encoding the
filenames doesn't seem to be necessary.

6. messages that are printed to stdout should be encoded first, too;
the same with strings I use to call external shell commands.

############ file UnicodeHandler.py ##################################
# -*- coding: iso-8859-1 -*-
import Tkinter
import sys
import locale
import codecs

def _find_codec(encoding):
# return True if the requested codec is available, else return
False
try:
codecs.lookup(encoding)
return 1
except LookupError:
print 'Warning: codec %s not found' % encoding
return 0

def _sysencoding():
# try to guess the system default encoding
try:
enc = locale.getpreferredencoding().lower()
if _find_codec(enc):
print 'Setting locale to %s' % enc
return enc
except AttributeError:
# our python is too old, try something else
pass
enc = locale.getdefaultlocale()[1].lower()
if _find_codec(enc):
print 'Setting locale to %s' % enc
return enc
# the last try
enc = sys.stdin.encoding.lower()
if _find_codec(enc):
print 'Setting locale to %s' % enc
return enc
# aargh, nothing good found, fall back to latin1 and hope for the
best
print 'Warning: cannot find usable locale, using latin-1'
return 'iso-8859-1'

sysencoding = _sysencoding()

def fsdecode(input, errors='strict'):
'''Fail-safe decodes a string into unicode.'''
if not isinstance(input, unicode):
return unicode(input, sysencoding, errors)
return input

def fsencode(input, errors='strict'):
'''Fail-safe encodes a unicode string into system default
encoding.'''
if isinstance(input, unicode):
return input.encode(sysencoding, errors)
return input


class UnicodeVar(Tkinter.StringVar):
def __init__(self, master=None, errors='strict'):
Tkinter.StringVar.__init__(self, master)
self.errors = errors
self.trace('w', self._str2unicode)

def _str2unicode(self, *args):
old = self.get()
if not isinstance(old, unicode):
new = fsdecode(old, self.errors)
self.set(new)
#######################################################################

So before I start to mess up all of my code, maybe someone can give me
a hint if I still forgot something I should keep in mind or if I am
completely wrong somewhere.

Thanks in advance

Michael
 
N

Neil Hodgson

Michael:
5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;

This can lead to failure on Windows when the true Unicode file name can
not be encoded in the current system encoding.

Neil
 
K

klappnase

Neil Hodgson said:
Michael:


This can lead to failure on Windows when the true Unicode file name can
not be encoded in the current system encoding.

Neil

Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:

1. already existing files

2. automatically generated filenames, which result from adding an
ascii-only suffix to an existing filename (like xy --> xy_bak2)

3. filenames created by user input

?
If yes, how to avoid these?

Any hints are appreciated

Michael
 
N

Neil Hodgson

Michael:
Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:
...
3. filenames created by user input

Have you worked out how you want to handle user input that is not
representable in the encoding? It is easy for users to input any characters
into a Unicode enabled UI either through invoking an input method or by
copying and pasting from another application or character chooser applet.

Neil
 
K

klappnase

Neil Hodgson said:
Michael:


Have you worked out how you want to handle user input that is not
representable in the encoding? It is easy for users to input any characters
into a Unicode enabled UI either through invoking an input method or by
copying and pasting from another application or character chooser applet.

Neil

As I must admit, no. I just couldn't figure out that someone will really do this.

I guess I could add a test like (pseudo code):

try:
test = fsdecode(input)# convert to unicode
test.encode(sysencoding)
except:
# show a message box with something like "Invalid file name"

Please tell me if you find any other possible gotchas.

Thanks so far

Michael
 
S

Serge Orlov

klappnase said:
Hello all,

I am trying to internationalize my Tkinter program using gettext and
encountered various problems, so it looks like it's not a trivial
task.

Considered that you decided to support old python versions, it's true.
Unicode support has gradually improved. If you choose to target old
python version, basically you're dealing with years old unicode
support.
After some "research" I made up a few rules for a concept that I hope
lets me avoid further encoding trouble, but I would feel more
confident if some of the experts here would have a look at the
thoughts I made so far and told me if I'm still going wrong somewhere
(BTW, the program is supposed to run on linux only). So here is what
I have so far:

1. use unicode instead of byte strings wherever possible. This can be
a little tricky, because in some situations I cannot know in advance
if a certain string is unicode or byte string; I wrote a helper
module for this which defines convenience methods for fail-safe
decoding/encoding of strings and a Tkinter.UnicodeVar class which I
use to convert user input to unicode on the fly (see the code below).

I've never used tkinter, but I heard good things about it. Are you
sure it's not you who made it to return byte string sometimes?
Anyway, your idea is right, make IO libraries always return unicode.
3. make sure to NEVER mix unicode and byte strings within one
expression

As a rule of thumb you should convert byte strings into unicode
strings at input and back to byte strings at output. This way
the core of your program will have to deal only with unicode
strings.
4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.

I don't think so. Either you need decode/encode or you don't.
5. file operations seem to be delicate;

You should be ready to handle unicode errors at file operations as
well as for example ENAMETOOLONG error. Any file system with path
argument can throw it, I don't think anything changed here with
introduction of unicode. For example access can return 11 (on
my linux system) error codes, consider unicode error to be twelveth.
at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;

I think python 2.3 handles that for you. (I'm not sure about the
version)
If you have to support older versions, you have to do it yourself.

6. messages that are printed to stdout should be encoded first, too;
the same with strings I use to call external shell commands.

If you use stdout as dump device just install the encoder in the
beginning of your program, something like

sys.stdout = codecs.getwriter(...) ...
sys.stderr = codecs.getwriter(...) ...


Serge.
 
K

klappnase

I've never used tkinter, but I heard good things about it. Are you
sure it's not you who made it to return byte string sometimes?

Yes, I used a Tkinter.StringVar to keep track of the contents of an
Entry widget; as long as I entered only ascii characters get() returns
a byte string, as soon as a special character is entered it returns
unicode.
Anyway, my UnicodeVar() class seems to be a handy way to avoid
problems here.
I don't think so. Either you need decode/encode or you don't.

I use a bunch of modules that contain helper functions for frequently
repeated tasks. So it sometimes happens for example that I call one of
my module functions to convert user input into unicode and then call
the next module function to convert it back to byte string to start
some file operation; that's what I meant with "excess decode/encode
cycles". However, trying to avoid these ended in totally messing up
the code.
You should be ready to handle unicode errors at file operations as
well as for example ENAMETOOLONG error. Any file system with path
argument can throw it, I don't think anything changed here with
introduction of unicode. For example access can return 11 (on
my linux system) error codes, consider unicode error to be twelveth.


I think python 2.3 handles that for you. (I'm not sure about the
version)
If you have to support older versions, you have to do it yourself.

I am using python-2.3.4 and get unicode errors:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
24-25: ordinal not in range(128)
Thanks for the feedback

Michael
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

klappnase said:
I am using python-2.3.4 and get unicode errors:



Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
24-25: ordinal not in range(128)

That's apparently a bug in os.access, which doesn't support Unicode file
names. As a work around, do

def access(name, mode, orig=os.access):
try:
return orig(name, mode)
except UnicodeError:
return orig(name.encode(sys.getfilesystemencoding(), mode))
os.access=access

Apparently, access is used so rarely that nobody has noticed yet (or
didn't bother to report). os.path.isfile() builds on os.stat(), which
does support Unicode file names.

Regards,
Martin
 
K

klappnase

Martin v. Löwis said:
That's apparently a bug in os.access, which doesn't support Unicode file
names. As a work around, do

def access(name, mode, orig=os.access):
try:
return orig(name, mode)
except UnicodeError:
return orig(name.encode(sys.getfilesystemencoding(), mode))
os.access=access

Apparently, access is used so rarely that nobody has noticed yet (or
didn't bother to report). os.path.isfile() builds on os.stat(), which
does support Unicode file names.

Regards,
Martin

Ah, thanks!

Now another question arises: you use sys.getfilesystemencoding() to
encode the
file name, which looks like it's the preferred method. However when I
tried to
find out how this works I got a little confused again (from the
library reference):

getfilesystemencoding()

Return the name of the encoding used to convert Unicode filenames into
system file names, or None if the system default encoding is used. The
result value depends on the operating system:
(...)
* On Unix, the encoding is the user's preference according to the
result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET)
failed.


Anyway, my app currently runs with python-2.2 and I would like to keep
it that way if possible, so I wonder which is the preferred
replacement for sys.getfilesystemencoding() on versions < 2.3 , or in
particular, will the method I use to determine "sysencoding" I
described in my original post be safe or are there any traps I missed
(it's supposed to run on linux only)?

Thanks and best regards

Michael
 
S

stewart.midwinter

Michael:

on my box, (winXP SP2), sys.getfilesystemencoding() returns 'mbcs'.

If you post your revised solution to this unicode problem, I'd be
delighted to test it on Windows. I'm working on a Tkinter front-end
for Vivian deSmedt's rsync.py and would like to address the issue of
accented characters in folder names.

thanks
Stewart
stewart dot midwinter at gmail dot com
 
K

klappnase

Michael:

on my box, (winXP SP2), sys.getfilesystemencoding() returns 'mbcs'.

Oh, from the reading docs I had thought XP would use unicode:

* On Windows 9x, the encoding is ``mbcs''.
* On Mac OS X, the encoding is ``utf-8''.
* On Unix, the encoding is the user's preference according to the
result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET)
failed.
* On Windows NT+, file names are Unicode natively, so no conversion is
performed.

Maybe that's for compatibility between different Windows flavors.
If you post your revised solution to this unicode problem, I'd be
delighted to test it on Windows. I'm working on a Tkinter front-end
for Vivian deSmedt's rsync.py and would like to address the issue of
accented characters in folder names.

thanks
Stewart
stewart dot midwinter at gmail dot com

I wrote it for use with linux only, and it looks like using the system
encoding as I try to guess it in my UnicodeHandler module (see the
first post) is fine there.

When on windows the filesystemencoding differs from what I get in
UnicodeHandler.sysencoding I guess I would have to define separate
convenience methods for decoding/encoding filenames with sysencoding
replaced with sys.getfilesystemencoding()( I found the need for these
convenience methods when I discovered that some strings I used were
sometimes unicode and sometimes not, and I have a lot of interactions
between several modules which makes it hard to track which I have
sometimes).

Tk seems to be pretty smart on handling unicode, so using unicode for
everything that's displayed on tk widgets should be ok (I hope).

So filling a listbox with the contents of a directory "pathname" looks
like this:

pathname = fsencode(pathname)# make sure it's a byte string, for
python2.2 compatibility
flist = map(fsdecode, os.listdir(pathname))
flist.sort()
for item in flist:
listbox.insert('end', item)

For file operations I have written a separate module which defines
convenience methods like these:

##########################################

def remove_ok(self, filename, verbose=1):
b, u = fsencode(filename), fsdecode(filename)
if not os.path.exists(b):
if verbose:
# popup a dialog box, similar to tkMessageBox
MsgBox.showerror(parent=self.parent, message=_('File not
found:\n"%s"') % u)
return 0
elif os.path.isdir(b):
if verbose:
MsgBox.showerror(parent=self.parent, message=_('Cannot
delete "%s":\nis a directory') % u)
return 0
if not os.access(os.path.dirname(b), os.W_OK):
if verbose:
MsgBox.showerror(parent=self.parent, message=_('Cannot
delete "%s":\npermission denied.') % u)
return 0
return 1

def remove(self, filename, verbose=1):
b, u = fsencode(filename), fsdecode(filename)
if self.remove_ok(filename, verbose=verbose):
try:
os.remove(b)
return 1
except:
if verbose:
MsgBox.showerror(parent=self.parent, message=_('Cannot
delete "%s":\npermission denied.') % u)
return 0

###################################

It looks like you don't need to do any encoding of filenames however,
if you use python2.3 (at least as long as you don't have to call
os.access() ), but I want my code to run with python2.2 ,too.

I hope this answers your question. Unfortunately I cannot post all of
my code here, because it's quite a lot of files, but the basic concept
is still the same as I wrote in the first post.

Best regards

Michael
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

klappnase said:
'ANSI_X3.4-1968'

In the locale API, you have to do

locale.setlocale(locale.LC_ALL, "")

to activate the user's preferences. Python does that on startup,
but then restores it to the "C" locale, since that is the specified
locale for the beginning of the (Python) program.

Try that before invoking nl_langinfo.
Anyway, my app currently runs with python-2.2 and I would like to keep
it that way if possible, so I wonder which is the preferred
replacement for sys.getfilesystemencoding() on versions < 2.3 , or in
particular, will the method I use to determine "sysencoding" I
described in my original post be safe or are there any traps I missed
(it's supposed to run on linux only)?

I would put an nl_langinfo call in-between, since this is more reliable
than getdefaultlocale (which tries to process environment variables
themselves and happens to crash if they are not in an expected form).

See idlelib/IOBinding.py for the algorithm that I use in IDLE to
determine the "user's" encoding. On most systems, this encoding is
good for usage on the file system API, except for MacOS X, which
uses UTF-8 to encode file names regardless of user or system
settings.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

klappnase said:
Oh, from the reading docs I had thought XP would use unicode:

It depends on the API that the application uses. Windows has the
"ANSI" (*) API (e.g. CreateFileExA) and the "Unicode" API
(CreateFileExW). The ANSI API uses what Python calls the "mbcs"
encoding; Windows calls it the ANSI code page (CP_ANSI). The
Unicode API expects WCHAR pointers.

Python uses the *W APIs since Python 2.3 (I believe), except that
maybe os.access was overlooked in 2.3 as well, so it uses the *W
API for access only in 2.4. At run-time, it dynamically decides
which API to use, and uses *W on NT+ (i.e. NT, W2k, WXP, W2k3, ...).
* On Windows 9x, the encoding is ``mbcs''.
Correct.

* On Mac OS X, the encoding is ``utf-8''.
Correct.

* On Unix, the encoding is the user's preference according to the
result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET)
failed.

Correct. In the latter case, it falls back to sys.getdefaultencoding().
When on windows the filesystemencoding differs from what I get in
UnicodeHandler.sysencoding

That could happen on OS X.
Tk seems to be pretty smart on handling unicode, so using unicode for
everything that's displayed on tk widgets should be ok (I hope).

So do I.

Regards,
Martin
 
K

klappnase

Martin v. Löwis said:
In the locale API, you have to do

locale.setlocale(locale.LC_ALL, "")

to activate the user's preferences. Python does that on startup,
but then restores it to the "C" locale, since that is the specified
locale for the beginning of the (Python) program.

Try that before invoking nl_langinfo.


I would put an nl_langinfo call in-between, since this is more reliable
than getdefaultlocale (which tries to process environment variables
themselves and happens to crash if they are not in an expected form).

Thanks!!

Actually I came to try my code on another box today which still runs
python2.2 and found that my original code crashed because neither
sys.getpreferredencoding() nor sys.stdin.encoding exist and
locale.getdefaultlocale()[1] returnd 'de' .So I changed my
_sysencoding() function to this:

def _sysencoding():
# try to guess the system default encoding
try:
enc = locale.getpreferredencoding().lower()
if _find_codec(enc):
print 'Setting locale to %s' % enc
return enc
except AttributeError:
# our python is too old, try something else
pass
locale.setlocale(locale.LC_ALL, "")
enc = locale.nl_langinfo(locale.CODESET).lower()
if _find_codec(enc):
print 'Setting locale to %s' % enc
return enc
# the last try
enc = locale.getdefaultlocale()[1].lower()
if _find_codec(enc):
print 'Setting locale to %s' % enc
return enc
# aargh, nothing good found, fall back to latin1 and hope for the
best
print 'Warning: cannot find usable locale, using latin-1'
return 'iso-8859-1'
See idlelib/IOBinding.py for the algorithm that I use in IDLE to
determine the "user's" encoding.

I guess I should have done so from the beginning.

Thanks again and best regards

Michael
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

klappnase said:
enc = locale.nl_langinfo(locale.CODESET).lower()

Notice that this may fail on systems which don't provide the
CODESET information. Recent Linux systems (glibc 6) have it,
and so do recent Solaris systems, but if you happen to use
an HPUX9 or some such, you find that locale.CODESET raises
an AttributeError.

Regards,
Martin
 
K

klappnase

Martin v. Löwis said:
Notice that this may fail on systems which don't provide the
CODESET information. Recent Linux systems (glibc 6) have it,
and so do recent Solaris systems, but if you happen to use
an HPUX9 or some such, you find that locale.CODESET raises
an AttributeError.

Regards,
Martin

Thanks again,

Things are really tricky and my hair begins to turn gray ;-)
So it seems like I'll have to add another try/except condition (and
now it finally looks pretty much like I had directly copied your code
from IDLE).

Best regards

Michael
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top