G
gabor
hi,
from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:
"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."
i'm on Unix. (linux, ubuntu edgy)
so it seems that it does not always return unicode filenames.
it seems that it tries to interpret the filenames using the filesystem's
encoding, and if that fails, it simply returns the filename as byte-string.
so you get back let's say an array of 21 filenames, from which 3 are
byte-strings, and the rest unicode strings.
after digging around, i found this in the source code:
so if the to-unicode-conversion fails, it falls back to the original
byte-string. i went and have read the patch-discussion.
and now i'm not sure what to do.
i know that:
1. the documentation is completely wrong. it does not always return
unicode filenames
2. it's true that the documentation does not specify what happens if the
filename is not in the filesystem-encoding, but i simply expected that i
get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them. but this is just plain wrong. from now on,
EVERYWHERE where i use os.listdir, i will have to go through all the
filenames in it, and check if they are unicode-strings or not.
so basically i'd like to ask here: am i reading something incorrectly?
or am i using os.listdir the "wrong way"? how do other people deal with
this?
p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail, because the auto-convert only
happens using 'ascii' as the encoding, and if it was not possible to
decode the filename inside listdir, it's quite probable that it also
will not work using 'ascii' as the charset.
gabor
from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:
"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."
i'm on Unix. (linux, ubuntu edgy)
so it seems that it does not always return unicode filenames.
it seems that it tries to interpret the filenames using the filesystem's
encoding, and if that fails, it simply returns the filename as byte-string.
so you get back let's say an array of 21 filenames, from which 3 are
byte-strings, and the rest unicode strings.
after digging around, i found this in the source code:
#ifdef Py_USING_UNICODE
if (arg_is_unicode) {
PyObject *w;
w = PyUnicode_FromEncodedObject(v,
Py_FileSystemDefaultEncoding,
"strict");
if (w != NULL) {
Py_DECREF(v);
v = w;
}
else {
/* fall back to the original byte string, as
discussed in patch #683592 */
PyErr_Clear();
}
}
#endif
so if the to-unicode-conversion fails, it falls back to the original
byte-string. i went and have read the patch-discussion.
and now i'm not sure what to do.
i know that:
1. the documentation is completely wrong. it does not always return
unicode filenames
2. it's true that the documentation does not specify what happens if the
filename is not in the filesystem-encoding, but i simply expected that i
get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them. but this is just plain wrong. from now on,
EVERYWHERE where i use os.listdir, i will have to go through all the
filenames in it, and check if they are unicode-strings or not.
so basically i'd like to ask here: am i reading something incorrectly?
or am i using os.listdir the "wrong way"? how do other people deal with
this?
p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail, because the auto-convert only
happens using 'ascii' as the encoding, and if it was not possible to
decode the filename inside listdir, it's quite probable that it also
will not work using 'ascii' as the charset.
gabor