Encoding of file names

utabintarbo · Dec 8, 2005

Here is my situation:

I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

Help me, before my thin veneer of genius is torn from my boss's eyes!
;-)

Peter Hansen · Dec 8, 2005

utabintarbo said:
I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

I'm not sure of the answer, but note that .isfile() is not just checking
whether the filename is valid, it's checking that something *exists*
with that name, and that it is a file. Big difference... at least in
telling you where to look for the solution. In this case, checking
which of the two tests in ntpath.isfile() is actually failing might be a
first step if you don't have some other lead. (ntpath is what os.path
translates into on Windows, so look for ntpath.py in the Python lib folder.)

If you're really seeing what you're seeing, I suspect a bug since if
os.listdir() can find it (and it's really a file), os.isfile() should
report it as a file, I would think.

-Peter

Peter Otten · Dec 8, 2005

utabintarbo said:
I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

Does the problem persist if you feed os.listdir() a unicode path?
This will cause listdir() to return unicode filenames which are less prone
to encoding confusion.

Peter

Kent Johnson · Dec 8, 2005

utabintarbo said:
Here is my situation:

I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

Just to eliminate the obvious, you are calling os.path.join() with the
parent name before calling isfile(), yes? Something like

for f in os.listdir(someDir):
fp = os.path.join(someDir, f)
if os.path.isfile(fp):
...

Kent

Fredrik Lundh · Dec 8, 2005

utabintarbo said:
I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')

how did you print that name? "\xa6" is a "broken vertical bar", which, as
far as I know, is a valid filename character under both Unix and Windows.

if DIR is a variable that points to the remote directory, what does this
print:

import os
files = os.listdir(DIR)
file = files[0]
print file
print repr(file)
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)
print os.path.isdir(fullname)

(if necessary, replace [0] with an index that corresponds to one of
the problematic filenames)

when you've tried that, try this variation (only the listdir line has
changed):

import os
files = os.listdir(unicode(DIR)) # <-- this line has changed
file = files[0]
print file
print repr(file)
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)
print os.path.isdir(fullname)

</F>

utabintarbo · Dec 8, 2005

Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>

I believe that may do the trick. Here is the results of running your
code:

DIR = os.getcwd()
files = os.listdir(DIR)
file = files[-1]
file 'L07JS41C.04389525AA.QTR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model'
print file L07JS41C.04389525AA.QTRªINR.EªC-P.D11.081305.P2.KPF.model
print repr(file) 'L07JS41C.04389525AA.QTR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model'
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) False
print os.path.isdir(fullname) False
files = os.listdir(unicode(DIR))
file = files[-1]
print file L07JS41C.04389525AA.QTR¦INR.E¦C-P.D11.081305.P2.KPF.model
print repr(file) u'L07JS41C.04389525AA.QTR\u2592INR.E\u2524C-P.D11.081305.P2.KPF.model'
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) True <--- Success!
print os.path.isdir(fullname)

Click to expand...

Click to expand...

False

Thanks to all who posted.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Dec 8, 2005

utabintarbo said:
Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>

I believe that may do the trick. Here is the results of running your
code:

For all those who followed this thread, here is some more explanation:

Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT,
a vertical line in the middle, plus a line from that going left) into
a file name. How he managed to do that, I can only guess: most likely,
the Samba installation assumes that the file system encoding on
the Solaris box is some IBM code page (say, CP 437 or CP 850). If so,
the byte on disk would be \xb4. Where this came from, I have to guess
further: perhaps it is ACUTE ACCENT from ISO-8859-*.

Anyway, when he used listdir() to get the contents of the directory,
Windows applies the CP_ACP encoding (known as "mbcs" in Python).
For reasons unknown to me, the US and several European versions
of XP map this to \xa6, VERTICAL BAR (I can somewhat see that
as meaningful for U+2524, but not for U+2592).

So when he then applies isfile to that file name, \xa6 is mapped
to U+00A6, which then isn't found on the Samba side.

So while Unicode here is the solution, the problem is elsewhere;
most likely in a misconfiguration of the Samba server (which assumes
some encoding for the files on disk, yet the AIX application
uses a different encoding).

Regards,
Martin

Tom Anderson · Dec 9, 2005

For all those who followed this thread, here is some more explanation:

Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a
vertical line in the middle, plus a line from that going left) into a
file name. How he managed to do that, I can only guess: most likely, the
Samba installation assumes that the file system encoding on the Solaris
box is some IBM code page (say, CP 437 or CP 850). If so, the byte on
disk would be \xb4. Where this came from, I have to guess further:
perhaps it is ACUTE ACCENT from ISO-8859-*.

Anyway, when he used listdir() to get the contents of the directory,
Windows applies the CP_ACP encoding (known as "mbcs" in Python). For
reasons unknown to me, the US and several European versions of XP map
this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for
U+2524, but not for U+2592).

So when he then applies isfile to that file name, \xa6 is mapped to
U+00A6, which then isn't found on the Samba side.

So while Unicode here is the solution, the problem is elsewhere; most
likely in a misconfiguration of the Samba server (which assumes some
encoding for the files on disk, yet the AIX application uses a different
encoding).

Isn't the key thing that Windows is applying a non-roundtrippable
character encoding? If i've understood this right, Samba and Windows are
talking in unicode, with these (probably quite spurious, but never mind)
U+25xx characters, and Samba is presenting a quite consistent view of the
world: there's a file called "double bucky backlash grey box" in the
directory listing, and if you ask for a file called "double bucky backlash
grey box", you get it. Windows, however, maps that name to the 8-bit
string "double bucky blackslash vertical bar", but when you pass *that*
back to it, it gets encoded as the unicode string "double bucky backslash
vertical bar", which Sambda then doesn't recognise.

I don't know what Windows *should* do here. I know it shouldn't do this -
this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered. The
solution is either to apply an information-preserving encoding (UTF-8,
say), or to refuse to do it at all (ie, raise an error if there are
unencodable characters), neither of which are particularly beautiful
solutions. I think Windows is in a bit of a rock/hard place situation
here, poor thing.

Incidentally, for those who haven't come across CP_ACP before, it's not
yet another character encoding, it's a pseudovalue which means 'the
system's current default character set'.

tom

utabintarbo · Dec 9, 2005

Part of the reason (I think) is that our CAD/Data Management system
(which produces the aforementioned .MODEL files) substitutes (stupidly,
IMNSHO) non-printable characters for embedded spaces in file names.
This is part of what leads to my consternation here.

And yeah, Windows isn't helping matters much. No surprise there.

Just for s&g's, I ran this on python 2.3 on knoppix:

DIR = os.getcwd()
files = os.listdir(DIR)
file = files[-1]
print file L07JS41C.04389525AA.QTR±INR.E´C-P.D11.081305.P2.KPF.model
print repr(file) 'L07JS41C.04389525AA.QTR\xb1INR.E\xb4C-P.D11.081305.P2.KPF.model'
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) True <--- It works fine here
print os.path.isdir(fullname) False
files = os.listdir(unicode(DIR))
file = files[-1]
print file L07JS41C.04389525AA.QTR±INR.E´C-P.D11.081305.P2.KPF.model
print repr(file) 'L07JS41C.04389525AA.QTR\xb1INR.E\xb4C-P.D11.081305.P2.KPF.model'
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)

Click to expand...

Click to expand...

True <--- It works fine here
too!
This is when mounting the same samba share in Linux. This tends to
support Tom's point re:the "non-roundtrippability" thing.

Thanks again to all.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Dec 9, 2005

Tom said:
Isn't the key thing that Windows is applying a non-roundtrippable
character encoding?

This is a fact, but it is not a key thing. Of course Windows is
applying a non-roundtrippable character encoding. What else could it
do?

Windows, however, maps that name to the
8-bit string "double bucky blackslash vertical bar"

Only if you ask it to. There are two sets of APIs: one to apply
if you ask for byte strings (FindFirstFileA), and one to apply when you
ask for Unicode strings (FindFirstFileW).

In one case it has to convert; in the other, it doesn't.

I don't know what Windows *should* do here. I know it shouldn't do this
- this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered.

It always did this, and always will. Applications should stop using the
*A versions of the API. If they continue to do so, they will continue
to get bogus results in border cases.

The real issue here really is that there was a border case, when there
shouldn't be one.

Regards,
Martin

Tom Anderson · Dec 10, 2005

This is a fact, but it is not a key thing. Of course Windows is applying
a non-roundtrippable character encoding. What else could it do?

Well, i'm no great thinker, but i'd say that errors should never pass
silently, and that in the face of ambiguity, one should refuse the
temptation to guess. So, as i said in my post, if the name couldn't be
translated losslessly, an error should be raised.

It always did this, and always will. Applications should stop using the
*A versions of the API.

Absolutely true.

If they continue to do so, they will continue to get bogus results in
border cases.

No. The availability of a better alternative is not an excuse for
gratuitous breakage of the worse alternative.

tom

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Dec 10, 2005

Tom said:
Well, i'm no great thinker, but i'd say that errors should never pass
silently, and that in the face of ambiguity, one should refuse the
temptation to guess. So, as i said in my post, if the name couldn't be
translated losslessly, an error should be raised.

I believe this would not work, the way the API is structured. You do
first FindFirstFile, getting a file name and a ahandle. Then you do
FindNextFile repeatedly, passing the handle. An error of FindFirstFile
is indicated by returning an invalid handle.

So if you wanted FindFirstFile to return an error for unencodable file
names, it would not be possible to get a listing of the other files
in the directory.

FindFirstFile also gives the 8.3 file name (if present), and that is
valid without problems.

Regards,
Martin

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Encoding of file names

utabintarbo

Peter Hansen

Peter Otten

Kent Johnson

Fredrik Lundh

utabintarbo

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tom Anderson

utabintarbo

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tom Anderson

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads