zipfile module: problems with filename having non ascii characters

V

vincent_delft

I've a simple python script that read a directory and put the files into a
Zip file.

I'm using the os.walk method to get the directory content,
I'm creating ZipInfo objects and set "filename", ... to what os.walk give
me.
....
And it works!!!!

BUT!!

When I open the created zip file with "WinZip" (or any other zip tool)
filenames are not always like they should be.
In fact filenames with characters like "é","è","ç" are not correctly defined
in the zip file.

Does any one knows what must be done ?
Does this is a "unicode" problem ?
Does this is a known bug in ZipFile module ?
????



Thanks

Vincent
 
J

Jeff Epler

Zip files don't have a way to define the encoding of filenames---names
are just byte strings, and different utilities may interpret them in
different ways. The only thing that seems to be defined is that '/' is
the directory separator, and possibly that the filename can't contain
'\0'.

You can probably find the encoding that winzip uses with a little
trial-and-error, and convert your filenames in your encoding to
filenames in that encoding. This may depend on the language or region
of the installed Windows, though.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBJ7rcJd01MZaTXX0RAg3AAJ4j4bJi1zy5kJxIuPJm5y0RRrmDNQCglS+S
D+016AywZh98VkLrPOKyBbM=
=i06Z
-----END PGP SIGNATURE-----
 
V

vincent_delft

Jeff said:
Zip files don't have a way to define the encoding of filenames---names
are just byte strings, and different utilities may interpret them in
different ways. The only thing that seems to be defined is that '/' is
the directory separator, and possibly that the filename can't contain
'\0'.

Thanks, I've got the problem and replace all "\" to "/".

You can probably find the encoding that winzip uses with a little
trial-and-error, and convert your filenames in your encoding to
filenames in that encoding. This may depend on the language or region
of the installed Windows, though.

Thanks for the explanation.

That limitation is only valid for zip files ?
Is there an another "compression tool" that don't have such limitation
(tgz? , bz2? , ???à
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

That limitation is only valid for zip files ?

It appears that WinZip and other tools interpret the file names in a
zipfile in CP437. So to properly put non-ASCII file names into a
zipfile, you need to convert them into CP437. If the file name
contains a character which is not available in CP437, you cannot
save the file in a zipfile (without renaming it).

Not really a Unicode problem, but rather a problem that Unicode
tries to solve.
Is there an another "compression tool" that don't have such limitation
(tgz? , bz2? , ???à

tar, traditionally, is also unaware of character sets. Single Unix 3
(and I believe also earlier) ended the tar wars with the introduction
of the pax utility, which does allow for specification of a character
set in a pax file; among the supported character sets are ISO-8859-n,
and UTF-8.

Jörg Schilling's star(1) also uses UTF-8 for file names.

On the non-tar side of the world, WinRAR supports Unicode in archives.
For compatibility, they also put a non-Unicode name into the archive,
but the Unicode name, if present, is meant to take precedence.

Regards,
Martin
 
V

vincent_delft

Martin v. Löwis said:
It appears that WinZip and other tools interpret the file names in a
zipfile in CP437. So to properly put non-ASCII file names into a
zipfile, you need to convert them into CP437. If the file name
contains a character which is not available in CP437, you cannot
save the file in a zipfile (without renaming it).

Thanks, with cp437 it rocks!!!!

Not really a Unicode problem, but rather a problem that Unicode
tries to solve.


tar, traditionally, is also unaware of character sets. Single Unix 3
(and I believe also earlier) ended the tar wars with the introduction
of the pax utility, which does allow for specification of a character
set in a pax file; among the supported character sets are ISO-8859-n,
and UTF-8.

Thanks for the info.
Jörg Schilling's star(1) also uses UTF-8 for file names.

On the non-tar side of the world, WinRAR supports Unicode in archives.
For compatibility, they also put a non-Unicode name into the archive,
but the Unicode name, if present, is meant to take precedence.

Thus, the most "portable" compression tool.

Thanks for those valuable remarks.

Vincent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top