Unicode entries on sys.path

T

Thomas Heller

I was trying to track down a bug in py2exe where the executable did
not work when it is in a directory containing japanese characters.

Then, I discovered that part of the problem is in the zipimporter that
py2exe uses, and finally I found that it didn't even work in Python
itself.

If the entry in sys.path contains normal western characters, umlauts for
example, it works fine. But when I copied some japanese characters from
a random web page, and named a directory after that, it didn't work any
longer.

The windows command prompt is not able to print these characters,
although windows explorer has no problems showing them.

Here's the script, the subdirectory contains the file 'somemodule.py',
but importing this fails:

import sys
sys.path = [u'\u5b66\u6821\u30c7xx']
print sys.path

import somemodule

It seems that Python itself converts unicode entries in sys.path to
normal strings using windows default conversion rules - is this a
problem that I can fix by changing some regional setting on my machine?

Hm, maybe more a windows question than a python question...

Thanks,
Thomas
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thomas said:
It seems that Python itself converts unicode entries in sys.path to
normal strings using windows default conversion rules - is this a
problem that I can fix by changing some regional setting on my machine?

You can set the system code page on the third tab on the XP
regional settings (character set for non-unicode applications).
This, of course, assumes that there is a character set that supports
all directories in sys.path. If you have Japanese characters on
sys.path, you certainly need to set the system locale to Japanese
(is that CP932?).

Changing this setting requires a reboot.
Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

Regards,
Martin
 
J

Just

Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.[/QUOTE]

Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding.
I'm not sure how well it works together with zipimport, though.

Just
 
V

vincent wehren

Just said:
The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.


Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding. [/QUOTE]

For this conversion "mbcs" will be used on Windows machines, implying
that such conversions are made using the current system Ansi codepage.
(As a matter of interest: What is this on OSX?). This conversion is
likely to be useless for unicode directory names containing characters
that do not have a mapping to a character in this particular codepage.

The technique described by Martin may solve the problem for what in this
case are Japanese characters, but what if I have directory names from
another language group, such as simpliefied Chinese, as well?

The only way to get around this is to allow - as Martin suggests -
arbitrary unicode strings in sys.path on those platforms that may have
unicode file names.
 
J

Just

vincent wehren said:
For this conversion "mbcs" will be used on Windows machines, implying
that such conversions are made using the current system Ansi codepage.
(As a matter of interest: What is this on OSX?).

UTF-8.

Just
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Just said:
Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding.
I'm not sure how well it works together with zipimport, though.

As Vincent's message already implies, I'm asking for Windows patches.
In a Windows system, there are path names which just *don't have*
a representation in the file system default encoding. So you just
can't use the standard file system API (open, read, write) to access
those files - instead, you have to use specific Unicode variants
of the file system API.

The only operating system in active use that can reliably represent
all file names in the standard API is OS X. Unix can do that as
long as the locale is UTF-8; for all other systems, there are
restrictions when you try to use the file system API to access
files with "funny" characters.

Regards,
Martin
 
B

Bengt Richter

You can set the system code page on the third tab on the XP
regional settings (character set for non-unicode applications).
This, of course, assumes that there is a character set that supports
all directories in sys.path. If you have Japanese characters on
sys.path, you certainly need to set the system locale to Japanese
(is that CP932?).

Changing this setting requires a reboot.


The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.
What about removable drives? And mountable multiple file system types?
Maybe some collections of potentially homogenous file system references
such as sys.path need to be virtualized to carry relevant file system
encoding and protocol info etc. That could cover synthetic or compressed
info sources too, IWT. Homogeneous package representation could be a similar
problem, I guess.

Regards,
Bengt Richter
 
T

Thomas Heller

Martin v. Löwis said:
You can set the system code page on the third tab on the XP
regional settings (character set for non-unicode applications).
This, of course, assumes that there is a character set that supports
all directories in sys.path. If you have Japanese characters on
sys.path, you certainly need to set the system locale to Japanese
(is that CP932?).

Changing this setting requires a reboot.


The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

How should these patches be approached? On windows, it would probably
be easiest to use the MS generic text routines: _tcslen instead of
strlen, for example, and to rely on the _UNICODE preprocessor symbol to
map this function to strlen or wcslen. Is there a similar thing in the
non-windows world?

Thomas
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt said:
What about removable drives? And mountable multiple file system types?

I'm not sure I understand the question. What about them?

On Windows, a removable drive will typically have its file names encoded
in UCS-2LE (i.e. "Unicode proper"), through the vfat, ntfs, or joliet
file systems. So if a Unicode file name in sys.path refers to them, and
a proper patch to use wide APIs is incorporated in Python, Python will
transparently find the files on these media.
Maybe some collections of potentially homogenous file system references
such as sys.path need to be virtualized to carry relevant file system
encoding and protocol info etc.

No no no. sys.path contains path names on the local system, nothing
virtualized (unless one of the existing hook mechanisms is used, which
would be OT for this thread).

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thomas said:
How should these patches be approached?

Please have a look as to how posixmodule.c and fileobject.c deal with
this issue.
On windows, it would probably
be easiest to use the MS generic text routines: _tcslen instead of
strlen, for example, and to rely on the _UNICODE preprocessor symbol to
map this function to strlen or wcslen.

No. This fails for two reasons:
1. We don't compile Python with _UNICODE, and never will do so. This
macro is only a mechanism to simplify porting code from ANSI APIs
to Unicode APIs, so you don't have to reformulate all the API calls.
For new code, it is better to use the Unicode APIs directly if you
plan to use them.
2. On Win9x, the Unicode APIs don't work (*). So you need to chose at
run-time whether you want to use wide or narrow API. Unless
a) we ship two binaries in the future, one for W9x, one for NT+
(I hope this won't happen), or
b) we drop support for W9x. I'm in favour of doing so sooner or
later, but perhaps not for Python 2.5.

Regards,
Martin

(*) Can somebody please report whether the *W file APIs fail on W9x
because the entry points are not there (so you can't even run the
binary), or because they fail with an error when called?
 
T

Thomas Heller

Martin v. Löwis said:
Please have a look as to how posixmodule.c and fileobject.c deal with
this issue.


No. This fails for two reasons:
1. We don't compile Python with _UNICODE, and never will do so. This
macro is only a mechanism to simplify porting code from ANSI APIs
to Unicode APIs, so you don't have to reformulate all the API calls.
For new code, it is better to use the Unicode APIs directly if you
plan to use them.
2. On Win9x, the Unicode APIs don't work (*). So you need to chose at
run-time whether you want to use wide or narrow API. Unless
a) we ship two binaries in the future, one for W9x, one for NT+
(I hope this won't happen), or
b) we drop support for W9x. I'm in favour of doing so sooner or
later, but perhaps not for Python 2.5.

I wasn't asking about the *W functions, I'm asking about string/unicode
handling in Python source files. Looking into Python/import.c, wouldn't
it be required to change the signature of a lot of functions to receive
PyObject* arguments, instead of char* ?
For example, find_module should change from
static struct filedescr *find_module(char *, char *, PyObject *,
char *, size_t, FILE **, PyObject **);

to

static struct filedescr *find_module(char *, char *, PyObject *,
PyObject **, FILE **, PyObject **);

where the fourth argument would now be either a PyString or PyUnicode
object pointer?
(*) Can somebody please report whether the *W file APIs fail on W9x
because the entry points are not there (so you can't even run the
binary), or because they fail with an error when called?

I always thought that the *W apis would not be there in win98, but it
seems that is wrong. Fortunately, how could Python, which links to the
FindFirstFileW exported function for example, run on win98 otherwise...

Thomas
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thomas said:
I wasn't asking about the *W functions, I'm asking about string/unicode
handling in Python source files. Looking into Python/import.c, wouldn't
it be required to change the signature of a lot of functions to receive
PyObject* arguments, instead of char* ?

Yes, that would be one solution. Another solution would be to provide an
additional Py_UNICODE*, and to allow that pointer to be NULL. Most
systems would ignore that pointer (and it would be NULL most of the
time), except on NT+, which would use the Py_UNICODE* if available,
and the char* otherwise.
I always thought that the *W apis would not be there in win98, but it
seems that is wrong. Fortunately, how could Python, which links to the
FindFirstFileW exported function for example, run on win98 otherwise...

Thanks, that is convincing.

Regards,
Martin
 
V

vincent wehren

Thomas said:
I wasn't asking about the *W functions, I'm asking about string/unicode
handling in Python source files. Looking into Python/import.c, wouldn't
it be required to change the signature of a lot of functions to receive
PyObject* arguments, instead of char* ?
For example, find_module should change from
static struct filedescr *find_module(char *, char *, PyObject *,
char *, size_t, FILE **, PyObject **);

to

static struct filedescr *find_module(char *, char *, PyObject *,
PyObject **, FILE **, PyObject **);

where the fourth argument would now be either a PyString or PyUnicode
object pointer?




I always thought that the *W apis would not be there in win98, but it
seems that is wrong. Fortunately, how could Python, which links to the
FindFirstFileW exported function for example, run on win98 otherwise...

Normally I would have thought this would require using the Microsoft
Layer for Unicode (unicows.dll).

According to MSDN 9x already does have a handful of unicode APIs.

FindFirstFile does not seem to be one of them - unless the list on

htpp://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp)

is bogus (?).
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

vincent said:
FindFirstFile does not seem to be one of them - unless the list on

htpp://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp)

is bogus (?).

It might perhaps be misleading: I think the entry points are there, but
calling the functions will always fail.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top