ZipEntry.getSize

R

Roedy Green

I can create zip files that Winzip says are valid.
All the lengths are there. They pass the test.

I can read zip files that Winzip creates.

However, I can't read zip files that I create in Java with
ZipOutputSTream. It claims the lengths of each entry are 0.
Is there some trick to making this work?

Here is a slightly simplified version of what I am doing to create the
zip:

String elementName = "adir/afile.txt";
ZipEntry entry = new ZipEntry( elementName );

File elementFile = new File ( "adir/afile.txt" );
entry.setTime( elementFile.getLastModified() );
int fileLength = (int) elementFile.length();

entry.setSize( fileLength );

FileInputStream fis = new FileInputStream ( elementFile );
byte[] wholeFile = new byte [ fileLength ];
int bytesRead = fis.read( wholeFile, 0 /* offset */, fileLength
);
fis.close();

// no need to setCRC, computed automatically.
zip.putNextEntry( entry );

zip.write( wholeFile, 0, fileLength );
zip.closeEntry();

I am getting the horrible feeling that ZipOutputstream is stupidly
designed so that it is up to you to fill in fields like size, CRC,
compressed size by ESP, otherwise only the summary at the end of the
file is accurate.
 
R

Roedy Green

I am getting the horrible feeling that ZipOutputstream is stupidly
designed so that it is up to you to fill in fields like size, CRC,
compressed size by ESP, otherwise only the summary at the end of the
file is accurate.

I sniffed the files ZipOutputStream generates.
The local headers have 0 in the crc-32
uncompressed size and compressed size fields.

I have a sneaky feeling somebody should be shot, or least fired. I
hope I am just using the ZipOutputStream class incorrectly.
 
R

Roedy Green

I sniffed the files ZipOutputStream generates.
The local headers have 0 in the crc-32
uncompressed size and compressed size fields.

I discovered though that bit 3 of the general flags is on. This is
used for streams that are not seekable. ZipOutputStream works as a
stream, even, for example, as a socket.

There is a lame way in the PKZip format to put the lengths AFTER the
data.

However, that just procrastinates the problem. When you go to read,
the fool ZipInputstream can't scan ahead to find the lengths, because
that too is a stream. So you have to read the stream not knowing the
size of the element. It won't even leave the length YOU set in the
header intact.

The alternative, may be to use ZipFile to read, which uses the index
at the end and allows random access.
 
S

Steve Claflin

Roedy said:
I can create zip files that Winzip says are valid.
All the lengths are there. They pass the test.

I can read zip files that Winzip creates.

However, I can't read zip files that I create in Java with
ZipOutputSTream. It claims the lengths of each entry are 0.
Is there some trick to making this work?

Here is a slightly simplified version of what I am doing to create the
zip:

String elementName = "adir/afile.txt";
ZipEntry entry = new ZipEntry( elementName );

File elementFile = new File ( "adir/afile.txt" );
entry.setTime( elementFile.getLastModified() );
int fileLength = (int) elementFile.length();

entry.setSize( fileLength );

FileInputStream fis = new FileInputStream ( elementFile );
byte[] wholeFile = new byte [ fileLength ];
int bytesRead = fis.read( wholeFile, 0 /* offset */, fileLength
);
fis.close();

// no need to setCRC, computed automatically.
zip.putNextEntry( entry );

zip.write( wholeFile, 0, fileLength );
zip.closeEntry();

I am getting the horrible feeling that ZipOutputstream is stupidly
designed so that it is up to you to fill in fields like size, CRC,
compressed size by ESP, otherwise only the summary at the end of the
file is accurate.

ZipEntry and related classes aren't very well documented. The size
seems to get written automatically (and maybe trying to write it
yourself mungs it somehow), and the lastModified works if set after
"loading" the zip entry. The following worked for me in Windows 98
under jdk 1.3:

import java.util.zip.*;
import java.io.*;

public class TestZipMultiFile {
public static void main(String[] args) {
ZipOutputStream zo;
ZipEntry ze;

FileInputStream fis;
BufferedInputStream bis;
byte[] data = new byte[1024];
int byteCount;

try {
FileOutputStream fos = new FileOutputStream("test.zip");
zo = new ZipOutputStream(fos);
fis = new FileInputStream("TestFile1.java");
bis = new BufferedInputStream(fis);

ze = new ZipEntry("TestFile1.java");
System.out.print("TestFile1.java");
zo.putNextEntry(ze);
while ((byteCount = bis.read(data, 0, 1024)) > -1) {
zo.write(data, 0, byteCount);
System.out.print("*");
}
System.out.println("*");
bis.close();
ze.setTime( new File("TestFile1.java").lastModified() );

fis = new FileInputStream("TestFile2.java");
bis = new BufferedInputStream(fis);

System.out.print("TestFile2.java");
ze = new ZipEntry("TestFile2.java");
zo.putNextEntry(ze);
while ((byteCount = bis.read(data, 0, 1024)) > -1) {
zo.write(data, 0, byteCount);
System.out.print("*");
}
System.out.println("*");
bis.close();
ze.setTime( new File("TestFile2.java").lastModified() );

zo.flush();
zo.close();
fos.close();
}
catch ( Exception e) { e.printStackTrace(); }
}
}
 
R

Roedy Green

ZipEntry and related classes aren't very well documented. The size
seems to get written automatically (and maybe trying to write it
yourself mungs it somehow), and the lastModified works if set after
"loading" the zip entry. The following worked for me in Windows 98
under jdk 1.3:

I had the same problem whether I set the length myself or not. I
explain what is going on at http://mindprod.com/jgloss/zip.html

Basically the format ZipOutputStream produces is not compatible with
ZipInputStream, though it is technically legal.
 
L

Luke Tulkas

Roedy Green said:
I can create zip files that Winzip says are valid.
All the lengths are there. They pass the test.

I can read zip files that Winzip creates.

However, I can't read zip files that I create in Java with
ZipOutputSTream. It claims the lengths of each entry are 0.
Is there some trick to making this work?

zis = new ZipInputStream(...)
while((entry = zis.getNextEntry()) != null) {
//If you ask this entry for size here, you get -1, so... just
ignore.
//Read from zis until you get -1.
//If you haven't kept track of the number of bytes you read from
zis, you can ask the entry for size now & be surprised. ;-)
}

Nothing to it, really. If you want I can mail you the code.
 
R

Roedy Green

//Read from zis until you get -1.
//If you haven't kept track of the number of bytes you read from
zis, you can ask the entry for size now & be surprised. ;-)

The documentation on this really stinks. They don't explain which
fields you have to set yourself. They don't tell you the order you
are supposed to use the methods. They don't tell you about the
getSize problem. They don't document the / \ problem. They don't
warn you about trying to read zips created with Winzip or Pkzip
because of unsupported compression algorithms.
 
H

Harald Hein

Roedy Green said:
The documentation on this really stinks. They don't explain which
fields you have to set yourself. They don't tell you the order
you are supposed to use the methods. They don't tell you about
the getSize problem. They don't document the / \ problem. They
don't warn you about trying to read zips created with Winzip or
Pkzip because of unsupported compression algorithms.

The whole API is a stupid hack done in a hurry to use the info-zip
library from within Java. It was just hacked to add JARs to Java.
The API only contains the rudimentary stuff. It was for sure never
intended to be published. I guess Sun just had to publish it when
they recognized that people might want to play with JARs and ZIPs
themself.

If you want to see some strange things, have a look at the jar tool
and wheep. They have interesting problems in the tool. E.g. the
MANIFEST file must be the first file in a jar. But it contains
per-file data. So you can't just walk through your list of files and
add them to the jar, while at the same time complete the data in the
MANIFEST file. Instead you have to run two passes over the input
files. The first to build the MANIFEST file, the second to add the
individual files. This creates the risk that a file might change
between the first and the second pass, leading to completely broken
JARs. The same with the new index data in a JAR. Here Sun didn't
even bother to update the JAR/ZIP API. Instead they do everything
hidden in the jar tool.

The jardiff tool in the WebStart framework is also interesting. It has
to compensate for different orders of entries in the input jars.
 
R

Roedy Green

I guess Sun just had to publish it when
they recognized that people might want to play with JARs and ZIPs
themself.

The other thing that is odd about them is they use native methods that
use long handles.

They are also at least an order of magnitude slower than Winzip/PkZip
for compressing.

At any rate my Replicator is now Replicating now I understand their
limitations.
 
R

Roedy Green

The whole API is a stupid hack done in a hurry to use the info-zip
library from within Java. It was just hacked to add JARs to Java.
The API only contains the rudimentary stuff. It was for sure never
intended to be published. I guess Sun just had to publish it when
they recognized that people might want to play with JARs and ZIPs
themself.

Part of the problem was they wanted to make ZipOutputStream a true
OutputStream even though the file structure properly requires random
access and buffering to create.

If I were re-inventing jar files, they would have an alphabetical
index at the HEAD of the file with absolute offsets into the file
where to find the data. There might be a little indexing added to
speed searching for a particular name, e.g. class file loading. There
would be no embedded headers. That index itself would be optionally
compressed too. The names of the elements would be in UTF-8 encoding.

You could open a ZIP, add elements, delete elements, merge other zips,
and when you closed, then it would do a flurry of copying to create
the new zip. There would be no need to uncompress and recompress to
merge two zip files.

We have no way to update a ZIP now, only create a new one from
scratch.


I'd also like to add convenience methods so you could just say which
files you wanted added, and it got them, with dates etc, and when it
unpacked them automatically did the necessary mkdirs( f.getParent() ).
 
H

Harald Hein

Roedy Green said:
The other thing that is odd about them is they use native methods
that use long handles.

This is because of the underlying info-zip library. At some places in
the Java jar/zip API the layer around the library is very thin and you
see the library implementation shining through. If you grap the library
from the net you see the similarities.

It gets even better whan you try to figure out stuff like the directory
in the Infaltor/Deflator. That magic byte[] goes directly into the
corresponding calls of the info-zip library.

And for the record, if someone googles for the directory stuff: That
array of bytes is supposed to contain a sequence of C-style null-
terminated strings. Don't ask about the encoding, we are back in C "a
char is a byte" land. Disgusting.
 
E

Eric Sosman

Roedy said:
If I were re-inventing jar files, they would have an alphabetical
index at the HEAD of the file with absolute offsets into the file
where to find the data. There might be a little indexing added to
speed searching for a particular name, e.g. class file loading. There
would be no embedded headers. That index itself would be optionally
compressed too. The names of the elements would be in UTF-8 encoding.

Wouldn't compression of the index just exacerbate the
problem Harald Hein mentioned concerning the MANIFEST file?
Actually, I think it makes the problem insoluble: You don't
know the file offsets until you know the size of the compressed
index, but you can't compress the index until you know the offset
values it contains, and if the offset values change the index may
compress to a different size, ... I imagine many .rgjar files
would settle down to a steady state after one or two passes,
but there's the nagging possibility of an eternal oscillation.
 
R

Roedy Green

Wouldn't compression of the index just exacerbate the
problem Harald Hein mentioned concerning the MANIFEST file?
Actually, I think it makes the problem insoluble:

For simplicity, you would put the length of the index uncompressed
followed by the index.

Aren't the decompressors capable of detecting the end of a stream just
from the compressed bytes? The PKZip format with the length AFTER the
data implies that.
 
E

Eric Sosman

Roedy said:
For simplicity, you would put the length of the index uncompressed
followed by the index.

Perhaps I didn't explain the problem clearly (or perhaps
I've just imagined the whole thing ...).

Your suggestion, if I understood correctly, was to put a
compressed index at the beginning of the .rgjar file. The index
would contain (among other things) the offsets of the various
content files. The offset of any particular content file is
the sum of the sizes of all things that appear before it, and
one of these things is the index. Thus, the values recorded in
the index depend on the size of the compressed index. But the
values also (potentially) influence the size of the compressed
index; change the values and you get a different compressed size.
Looks like a feedback loop to me.

You could avoid the loop by storing just the file sizes
instead of their offsets, along with a sequence number (or other
ordering information) to allow the offsets to be computed from
the decompressed index. But this is exactly Harald Hein's
problem: You'd now need to compress all the files *before*
creating the index, then write the index at the beginning of
the .rgjar file, then write all the compressed files. Byte code
isn't too voluminous and could probably be kept around in memory
between compression time and writing time, but if the .rgjar
archive also carries images, sounds, video clips, and the entire
database of RIAA lawsuits you're probably stuck with two complete
compression passes.
 
R

Roedy Green

Your suggestion, if I understood correctly, was to put a
compressed index at the beginning of the .rgjar file. The index
would contain (among other things) the offsets of the various
content files. The offset of any particular content file is
the sum of the sizes of all things that appear before it, and
one of these things is the index. Thus, the values recorded in
the index depend on the size of the compressed index. But the
values also (potentially) influence the size of the compressed
index; change the values and you get a different compressed size.

You have to build the index and the file separately then glue them
together at the last minute. The offsets in the compressed index are
relative to the end of the index, as if the index and the data were
two separate files.

If you tried to make them absolute offsets, you get into your chicken
and egg loop.

I notice now we are going for directories nested 10 deep with great
long names containing spaces. The NAMES of the files themselves are
sometimes just as big as the contents. There is plenty of opportunity
there for compressing.

On rethinking it may make more sense to tack the index on the end, so
long as in the very last bytes of the file is a pointer to the
beginning of the index. PKZip format lacks this. You must find the
start by wending your way back field by field.

This way you can append to the file more efficiently. You can take
newe data on the end, and then write a new index on the end, without
necessarily copying the entire front section. This is a more dangerous
way to live, but putting the index on the end would at least leave
that option.

Putting it on the front however, makes it easier to sample a zip
without downloading the whole thing.
 
Joined
Jan 22, 2008
Messages
1
Reaction score
0
Hello :)
Is there any way to create a zip file using ZipOutputStream which sets the sizes correctly? Using ZipFile at decompression is no option for me because we have a lot of clients in the field which do not use ZipFile and cannot be changed.
I have tried several libraries, but have to find a satisfying one yet...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top