distinction between unzipping bytes and unzipping a file

webcomm · Jan 9, 2009

Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.

Apologies if it's not appropriate to start a new thread for this. It
just seems like a different topic than how to deal with the resulting
FFFD characters.

Thanks for your help,
Ryan

webcomm · Jan 9, 2009

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

Sorry, that code is not what I mean to paste. This is what I
intended...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = popen("unzip data.zip")

Steve Holden · Jan 9, 2009

webcomm said:
Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

Not terribly useful advice, but one presumes he she or it was trying to
be helpful.

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

Well, what you have done appears pretty wrong to me, but let's take a
look. What's datum? You appear to be treating it as base64-encoded data;
is that correct? Have you examined it?

f = open('data.zip', 'wb')

opens the file data.zip for writing in binary. Not as a zip file, you
understand, just as a regular file. I suspect here you really needed

f = zipfile.ZipFile('data.zip', 'w')

Now, of course, you need to remember what zipfiles contain. Which is
other files. So the data you *write* tot he zipfile has to be associated
with a filename in the archive. Of course you don't have the data in a
file, you have it in a string, so you would use

f.writestr("somefile.dat", decoded)
f.close()

You have now written a zip file containing a single "somefile.dat" file
with the decoded base64 data in it. Open it with Winzip or one of its
buddies and see if anyone barfs.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.

But it's certainly not a zip file.

Apologies if it's not appropriate to start a new thread for this. It
just seems like a different topic than how to deal with the resulting
FFFD characters.

Don't worry about it.

regards
Steve

MRAB · Jan 9, 2009

webcomm said:
Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

Python's zipfile module can only read and write zip files; it can't
compress or decompress data as a bytestring.

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

If what you've been given is data which has been zipped and then base-64
encoded, then I can't see that you might be doing wrong.

webcomm · Jan 9, 2009

Not terribly useful advice, but one presumes he she or it was trying to
be helpful.

Well, what you have done appears pretty wrong to me, but let's take a
look. What's datum? You appear to be treating it as base64-encoded data;
is that correct? Have you examined it?

It's data that has been compressed then base64 encoded by the web
service. I'm supposed to download it, then decode, then unzip. They
provide a C# example of how to do this on page 13 of
http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

If you have a minute, see also this thread...
http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4

Chris Mellon · Jan 9, 2009

It's data that has been compressed then base64 encoded by the web
service. I'm supposed to download it, then decode, then unzip. They
provide a C# example of how to do this on page 13 of
http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

If you have a minute, see also this thread...
http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4

When they say "zip", they're talking about a zlib compressed stream of
bytes, not a zip archive.

You want to base64 decode the data, then zlib decompress it, then
finally interpret it as (I think) UTF-16, as that's what Windows
usually means when it says "Unicode".

decoded = base64.b64decode(datum)
decompressed = zlib.decompress(decoded)
result = decompressed.decode('utf-16')

Chris Mellon · Jan 9, 2009

When they say "zip", they're talking about a zlib compressed stream of
bytes, not a zip archive.

You want to base64 decode the data, then zlib decompress it, then
finally interpret it as (I think) UTF-16, as that's what Windows
usually means when it says "Unicode".

decoded = base64.b64decode(datum)
decompressed = zlib.decompress(decoded)
result = decompressed.decode('utf-16')

And of course as *soon* as I write that, I read the appendix on the
documentation in full and turn out to be wrong. Ignore me *sigh*.

It would really help if you could post a sample file somewhere.

webcomm · Jan 9, 2009

It would really help if you could post a sample file somewhere.

Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
....then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
....using the function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
....then the bad characters are \x00 characters.

John Machin · Jan 9, 2009

Here's a sample with some dummy data from the web service:http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

Your original problem is identical to that already reported by Chris
Mellon (gratuitous \0 bytes appended to the real archive contents).
Here's the output of the diagnostic gadget that I posted a few minutes
ago:
...........................................................
C:\downloads>python zip_susser_v2.py data.zip
archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
central_dir_overall_num_entries : 1
central_dir_size : 50
central_dir_offset : 844
comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'
....................................

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML.

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:

buff = open('data', 'rb').read()
buff[:100]

Click to expand...

Click to expand...

'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'

buff[:100].decode('utf_16_le')

Click to expand...

Click to expand...

But if I return it to my browser with python+django,
there are bad characters every other character

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

Yup, you've somehow pushed your utf_16_le-encoded data through some
decoder that doesn't like '\x00' and is replacing it with U+FFFD whose
name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is
"big fat Unicode version of the question mark".

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using the function at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.

Hmmm ... shouldn't make a difference how you extracted 'data' from
'data.zip'.

Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html

Cheers,
John

webcomm · Jan 10, 2009

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:

buff = open('data', 'rb').read()
buff[:100]

Click to expand...

Click to expand...

'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'>>> buff[:100].decode('utf_16_le')

There it is. Thanks.

u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

I did stop and consider what code to show. I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code. You
solved my problem with decode('utf_16_le'). I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW.

I didn't know the data was utf_16_le-encoded because I'm getting it
from a service. I don't even know if *they* know what encoding they
used. I'm not sure how you knew what the encoding was.

Please consider reading the Unicode HOWTO athttp://docs.python.org/howto/unicode.html

Probably wouldn't hurt, though reading that HOWTO wouldn't have given
me the encoding, I don't think.

-Ryan

John Machin · Jan 10, 2009

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:

buff = open('data', 'rb').read()
buff[:100]

Click to expand...

'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'
buff[:100].decode('utf_16_le')

Click to expand...

Click to expand...

There it is. Thanks.

u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'

Click to expand...

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

Click to expand...

I did stop and consider what code to show. I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code. You
solved my problem with decode('utf_16_le'). I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW.

Try searching using the official name UTF-16LE ... looks like a blind
spot in the approximate matching algorithm(s) used by the search engine
(s) that you tried :-(

I didn't know the data was utf_16_le-encoded because I'm getting it
from a service. I don't even know if *they* know what encoding they
used. I'm not sure how you knew what the encoding was.

Actually looked at the raw data. Pattern appeared to be an alternation
of 1 "meaningful" byte and one zero ('\x00') byte: => UTF16*. No BOM
('\xFE\xFF' or '\xFF\xFE') at start of file: => UTF16-?E. First byte
is meaningful: => UTF16-LE.

Probably wouldn't hurt,

Definitely won't hurt. Could even help.

though reading that HOWTO wouldn't have given
me the encoding, I don't think.

It wasn't intended to give you the encoding. Just read it.

Cheers,
John

The distinction between a java applet and an application	1	Jan 4, 2023
unzipping a zipx file	4	Apr 19, 2013
BadZipfile "file is not a zip file"	33	Jan 9, 2009
The type/object distinction and possible synthesis of OOP andimperative programming languages	39	Apr 15, 2013
Frustrating circular bytes issue	1	Jun 26, 2012
unzipping a zip file in aso.net	2	Jan 15, 2008
Importing from a Jar file WITHOUT unzipping it	9	Oct 27, 2008
Help me with a bytes decoding problem in python 3	0	Jun 21, 2012

distinction between unzipping bytes and unzipping a file

webcomm

webcomm

Steve Holden

MRAB

webcomm

Chris Mellon

Chris Mellon

webcomm

John Machin

webcomm

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads