LC_ALL and os.listdir()

K

Kenneth Pronovici

I have some confusion regarding the relationship between locale,
os.listdir() and unicode pathnames. I'm running Python 2.3.5 on a
Debian system. If it matters, all of the files I'm dealing with are on
an ext3 filesystem.

The real code this problem comes from takes a configured set of
directories to deal with and walks through each of those directories
using os.listdir().

Today, I accidentally ran across a directory containing three "normal"
files (with ASCII filenames) and one file with a two-character unicode
filename. My code, which was doing something like this:

for entry in os.listdir(path): # path is <type 'unicode'>
entrypath = os.path.join(path, entry)

suddenly started blowing up with the dreaded unicode error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 1: ordinal not in range(128)

To add insult to injury, it only happend for one of my test users, not
the others.

I ultimately traced the difference in behavior to the LC_ALL setting in
the environment. One user had LC_ALL set to en_US, and the other didn't
have it set at all.

For the user with LC_ALL set, the os.listdir() call returned this, and
the os.path.join() call succeeded:

[u'README.strange-name', u'\xe2\x99\xaa\xe2\x99\xac',
u'utflist.long.gz', u'utflist.cp437.gz', u'utflist.short.gz']

For the other user without LC_ALL set, the os.listdir() call returned
this, and the os.path.join() call failed with the UnicodeDecodeError
exception:

[u'README.strange-name', '\xe2\x99\xaa\xe2\x99\xac',
u'utflist.long.gz', u'utflist.cp437.gz', u'utflist.short.gz']

Note that in this second result, element [1] is not a unicode string
while the other three elements are.

Can anyone explain:

1) Why LC_ALL has any effect on the os.listdir() result?
2) Why only 3 of the 4 files come back as unicode strings?
3) The proper "general" way to deal with this situation?

My goal is to build generalized code that consistently works with all
kinds of filenames. Ultimately, all I'm trying to do is copy some files
around. I'd really prefer to find a programmatic way to make this work
that was independent of the user's configured locale, if possible.

Thanks for the help,

KEN
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Kenneth said:
1) Why LC_ALL has any effect on the os.listdir() result?

The operating system (POSIX) does not have the inherent notion
that file names are character strings. Instead, in POSIX, file
names are primarily byte strings. There are some bytes which
are interpreted as characters (e.g. '\x2e', which is '.',
or '\x2f', which is '/'), but apart from that, most OS layers
think these are just bytes.

Now, most *people* think that file names are character strings.
To interpret a file name as a character string, you need to know
what the encoding is to interpret the file names (which are byte
strings) as character strings.

There is, unfortunately, no operating system API to carry
the notion of a file system encoding. By convention, the locale
settings should be used to establish this encoding, in particular
the LC_CTYPE facet of the locale. This is defined in the
environment variables LC_CTYPE, LC_ALL, and LANG (searched
in this order).
2) Why only 3 of the 4 files come back as unicode strings?

If LANG is not set, the "C" locale is assumed, which uses
ASCII as its file system encoding. In this locale,
'\xe2\x99\xaa\xe2\x99\xac' is not a valid file name (atleast
it cannot be interpreted as characters, and hence not
be converted to Unicode).

Now, your Python script has requested that all file names
*should* be returned as character (ie. Unicode) strings, but
Python cannot comply, since there is no way to find out what
this byte string means, in terms of characters.

So we have three options:
1. skip this string, only return the ones that can be
converted to Unicode. Give the user the impression
the file does not exist.
2. return the string as a byte string
3. refuse to listdir altogether, raising an exception
(i.e. return nothing)

Python has chosen alternative 2, allowing the application
to implement 1 or 3 on top of that if it wants to (or
come up with other strategies, such as user feedback).
3) The proper "general" way to deal with this situation?

You can chose option 1 or 3; you could tell the user
about it, and then ignore the file, you could try to
guess the encoding (UTF-8 would be a reasonable guess).
My goal is to build generalized code that consistently works with all
kinds of filenames.

Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try

path = path.encode(sys.getfilesystemencoding())

This should work in most cases; Python will try to
determine the file system encoding from the environment,
and try to encode the file. Notice, however:

- on some systems, getfilesystemencoding may return None,
if the encoding could not be determined. Fall back
to sys.getdefaultencoding in this case.
- depending on where you got path from, this may
raise a UnicodeError, if the user has entered a
path name which cannot be encoding in the file system
encoding (the user may well believe that she has
such a file on disk).

So your code would read

try:
path = path.encode(sys.getfilesystemencoding() or
sys.getdefaultencoding())
except UnicodeError:
print >>sys.stderr, "Invalid path name", repr(path)
sys.exit(1)
Ultimately, all I'm trying to do is copy some files
around. I'd really prefer to find a programmatic way to make this work
that was independent of the user's configured locale, if possible.

As long as you manage to get a byte string from the path
entered, all should be fine.

Regards,
Martin
 
S

Serge Orlov

Martin v. Löwis said:
Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try

path = path.encode(sys.getfilesystemencoding())

Shouldn't os.path.join do that? If you pass a unicode string
and a byte string it currently tries to convert bytes to characters
but it makes more sense to convert the unicode string to bytes
and return two byte strings concatenated.

Serge.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Serge said:
Shouldn't os.path.join do that? If you pass a unicode string
and a byte string it currently tries to convert bytes to characters
but it makes more sense to convert the unicode string to bytes
and return two byte strings concatenated.

Sounds reasonable. OTOH, this would be the only (one of a very
few?) occasion where Python combines byte+unicode => byte.
Furthermore, it might be that the conversion of the Unicode
string to a file name fails as well.

That said, I still think it is a good idea, so contributions
are welcome.

Regards,
Martin
 
K

Kenneth Pronovici

So we have three options:
1. skip this string, only return the ones that can be
converted to Unicode. Give the user the impression
the file does not exist.
2. return the string as a byte string
3. refuse to listdir altogether, raising an exception
(i.e. return nothing)

Python has chosen alternative 2, allowing the application
to implement 1 or 3 on top of that if it wants to (or
come up with other strategies, such as user feedback).

Understood. This appears to be the most flexible solution among the
three.
3) The proper "general" way to deal with this situation?

You can chose option 1 or 3; you could tell the user
about it, and then ignore the file, you could try to
guess the encoding (UTF-8 would be a reasonable guess).
Ok.
My goal is to build generalized code that consistently works with all
kinds of filenames.

Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try [snip]
So your code would read

try:
path = path.encode(sys.getfilesystemencoding() or
sys.getdefaultencoding())
except UnicodeError:
print >>sys.stderr, "Invalid path name", repr(path)
sys.exit(1)

This makes sense to me. I'll work on implementing it that way.

Thanks for the in-depth explanation!

KEN

--
Kenneth J. Pronovici <[email protected]>
Personal Homepage: http://www.skyjammer.com/~pronovic/
"They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
- Benjamin Franklin, Historical Review of Pennsylvania, 1759
 
D

Duncan Booth

Martin said:
Sounds reasonable. OTOH, this would be the only (one of a very
few?) occasion where Python combines byte+unicode => byte.
Furthermore, it might be that the conversion of the Unicode
string to a file name fails as well.

That said, I still think it is a good idea, so contributions
are welcome.
It would probably mess up those systems where filenames really are unicode
strings and not byte sequences.

Windows (when using NTFS) stores all the filenames in unicode, and Python
uses the unicode api to implement listdir (when given a unicode path). This
means that the filename never gets encoded to a byte string either by the
OS or Python. If you use a byte string path than the filename gets encoded
by Windows and Python just returns what it is given.
 
S

Serge Orlov

Duncan said:
It would probably mess up those systems where filenames really are
unicode strings and not byte sequences.

Windows (when using NTFS) stores all the filenames in unicode, and
Python uses the unicode api to implement listdir (when given a
unicode path). This means that the filename never gets encoded to
a byte string either by the OS or Python. If you use a byte string
path than the filename gets encoded by Windows and Python just
returns what it is given.

Sorry for being not clear, but I meant posixpath.join since the whole
discussion is about posix systems.

Serge.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Duncan said:
Windows (when using NTFS) stores all the filenames in unicode, and Python
uses the unicode api to implement listdir (when given a unicode path). This
means that the filename never gets encoded to a byte string either by the
OS or Python. If you use a byte string path than the filename gets encoded
by Windows and Python just returns what it is given.

Serge's answer is good: you might only want to apply this algorithm to
posixpath. OTOH, in the specific case, it would not have caused problems
if it were applied to ntpath as well: the path was a Unicode string, so
listdir would have returned only Unicode strings (on Windows), and the
code in path.join dealing with mixed string types would not have been
triggered.

Again, I think the algorithm should be this:
- if both are the same kind of string, just concatenate them
- if not, try to coerce the byte string to a Unicode string, using
sys.getfileencoding()
- if that fails, try the other way 'round
- if that fails, let join fail.

The only drawback I can see with that approach is that it would "break"
environments where the system encoding is "undefined", i.e. implicit
string/unicode coercions are turned off. In such an environment, it
is probably desirable that os.path.join performs no coercion as well,
so this might need to get special-cased.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,221
Messages
2,571,131
Members
47,747
Latest member
swapote

Latest Threads

Top