Compress::Zlib vs. external gzip call

odigity · Oct 27, 2004

I'm writing a script that needs to run in as fast a time as possible.
Every minute counts. The script crawls a tree of gzipped files
totalling about 30gb. Originally I was calling open() with "gzip
$file |", but I hate making external calls - it requires a fork, and
you have very limited communication with the process for catching
errors and such. I always like using perl functions and modules when
possible over external calls. However, I wanted to make sure I
wouldn't take a performance hit before switching to Compress::Zlib.

I picked one of the bigger files (75mb) and ran some benchmarking on
it, comparing Compress::Zlib to an exernal call to the gzip utility.
Here's the code:

#!/usr/bin/perl -w
use strict;
use Benchmark qw( cmpthese );
use Compress::Zlib;
use IO::File;

my $file = 'sample.gz';

print "warming up the file...\n";
system( "zcat $file > /dev/null" );

print "starting comparison...\n";
cmpthese( 3, {
'ext_gzip' => \&ext_gzip,
'compress_zlib' => \&compress_zlib,
});

sub ext_gzip
{
my $fh = IO::File->new( "gzip -cd $file |" ) or die( "could not gzip
-cd '$file' for reading: $!" );
my $lines = 0;
while ( defined(my $line = $fh->getline()) ) {
$lines++;
}
$fh->close();
print "ext_gzip: $lines lines\n";
}

sub compress_zlib
{
my $gz = gzopen( $file, 'rb' ) or die( "could not gzopen '$file' for
reading: $!" );
my $line;
my $lines = 0;
while ( ( my $bytes = $gz->gzreadline( $line ) ) > 0 ) {
die( $gz->gzerror ) if ( $bytes == -1 );
$lines++;
}
$gz->gzclose();
print "compress_zlib: $lines lines\n";
}

Here's the output:

warming up the file...
starting comparison...
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
(warning: too few iterations for a reliable count)
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
(warning: too few iterations for a reliable count)
s/iter compress_zlib ext_gzip
compress_zlib 68.6 -- -23%
ext_gzip 52.8 30% --

Now, this wasn't the best possible benchmarking test, but I still
think I am justified in being concerned.

Any help in either a) interpreting these results, b) suggesting better
benchmarking methods, c) explaining why Compress::Zlib is slower than
gzip, and most importantly, d) how to improve performance, would be
appreciated.

-ofer

Stuart Moore · Oct 28, 2004

odigity said:
I'm writing a script that needs to run in as fast a time as possible.
Every minute counts. The script crawls a tree of gzipped files
totalling about 30gb. Originally I was calling open() with "gzip
$file |", but I hate making external calls - it requires a fork, and
you have very limited communication with the process for catching
errors and such. I always like using perl functions and modules when
possible over external calls. However, I wanted to make sure I
wouldn't take a performance hit before switching to Compress::Zlib.

Just thinking out loud here:
- Would the time measured by "Benchmark" include the time to start gzip?
Does it measure total time, or just time when the perl process is using
the CPU? Do the times mentioned match what you'd get with a stopwatch?

- Might it be worth looking at some of the smaller files as well,
possibly the time taken to open gzip is less significant on the large
ones than the small ones?

- Is there any way you can keep the gzip process open and only call it
once to decompress multiple files? One fork is better than many

Sisyphus · Oct 28, 2004

odigity said:
sub ext_gzip
{
my $fh = IO::File->new( "gzip -cd $file |" ) or die( "could not gzip
-cd '$file' for reading: $!" );
my $lines = 0;
while ( defined(my $line = $fh->getline()) ) {
$lines++;
}
$fh->close();
print "ext_gzip: $lines lines\n";
}

sub compress_zlib
{
my $gz = gzopen( $file, 'rb' ) or die( "could not gzopen '$file' for
reading: $!" );
my $line;
my $lines = 0;
while ( ( my $bytes = $gz->gzreadline( $line ) ) > 0 ) {

The next line is a waste of time. If $bytes is -1 then the code inside
the loop will not be executed. Also this is one test that the other
subroutine doesn't have to do. I don't think, however, that it will
account for the entire time difference .... remove it and see.

I also wonder whether there is more overhead in determining whether
$bytes>0 than there is determining whether $line is defined.

And I don't know how 'getline()' and 'gzreadline()' compare - both in
terms of what they actually do, and in terms of how fast they do it.

die( $gz->gzerror ) if ( $bytes == -1 );
$lines++;
}
$gz->gzclose();
print "compress_zlib: $lines lines\n";
}

Cheers,
Rob

Anno Siegel · Oct 28, 2004

odigity said:
I'm writing a script that needs to run in as fast a time as possible.
Every minute counts. The script crawls a tree of gzipped files
totalling about 30gb. Originally I was calling open() with "gzip
$file |", but I hate making external calls - it requires a fork, and
you have very limited communication with the process for catching
errors and such. I always like using perl functions and modules when
possible over external calls. However, I wanted to make sure I
wouldn't take a performance hit before switching to Compress::Zlib.

I picked one of the bigger files (75mb) and ran some benchmarking on
it, comparing Compress::Zlib to an exernal call to the gzip utility.
Here's the code:

#!/usr/bin/perl -w
use strict;
use Benchmark qw( cmpthese );
use Compress::Zlib;
use IO::File;

my $file = 'sample.gz';

print "warming up the file...\n";
system( "zcat $file > /dev/null" );

print "starting comparison...\n";
cmpthese( 3, {
'ext_gzip' => \&ext_gzip,
'compress_zlib' => \&compress_zlib,
});

sub ext_gzip
{
my $fh = IO::File->new( "gzip -cd $file |" ) or die( "could not gzip
-cd '$file' for reading: $!" );
my $lines = 0;
while ( defined(my $line = $fh->getline()) ) {
$lines++;
}
$fh->close();
print "ext_gzip: $lines lines\n";
}

sub compress_zlib
{
my $gz = gzopen( $file, 'rb' ) or die( "could not gzopen '$file' for
reading: $!" );
my $line;
my $lines = 0;
while ( ( my $bytes = $gz->gzreadline( $line ) ) > 0 ) {
die( $gz->gzerror ) if ( $bytes == -1 );
$lines++;
}
$gz->gzclose();
print "compress_zlib: $lines lines\n";
}

Here's the output:

warming up the file...
starting comparison...
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
(warning: too few iterations for a reliable count)
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
(warning: too few iterations for a reliable count)
s/iter compress_zlib ext_gzip
compress_zlib 68.6 -- -23%
ext_gzip 52.8 30% --

Now, this wasn't the best possible benchmarking test, but I still
think I am justified in being concerned.

I'm afraid the benchmark is useless. Benchmark doesn't count the
CPU time spent in children, so you're not catching the interesting
part in ext_gzip.

Anno

odigity · Oct 28, 2004

Stuart Moore said:
Just thinking out loud here:
- Would the time measured by "Benchmark" include the time to start gzip?
Does it measure total time, or just time when the perl process is using
the CPU? Do the times mentioned match what you'd get with a stopwatch?

I'm not sure if Benchmark is capable of supervising child processes
off the top of my head. I probably need to take that into account and
just use straight clocktime and enough iterations to smooth out system
behaviour.

- Might it be worth looking at some of the smaller files as well,
possibly the time taken to open gzip is less significant on the large
ones than the small ones?

Perhaps... most of the files are small, but I think most of the time
is spent on the few big files. And I also simply wanted to determine
which was faster at actual decompression. Still, a valid point.

- Is there any way you can keep the gzip process open and only call it
once to decompress multiple files? One fork is better than many

Hmm... I suppose I could use open2 to connect to both STDIN and STDOUT
and keep feeding it, but then I'd have to read the files myself into
the perl environment and print it out to the gzip process, which I'd
bet money will be slower. And there are too many files to build a
list and shove them onto a single command line. Man gzip reveals no
option for fetching a list of files from the command line.

odigity · Oct 28, 2004

Sisyphus said:
The next line is a waste of time. If $bytes is -1 then the code inside
the loop will not be executed. Also this is one test that the other
subroutine doesn't have to do. I don't think, however, that it will
account for the entire time difference .... remove it and see.

Yes; I rearranged my code a few times before settling on a pattern I
like, and that bug remained as a consequence. I don't think a scalar
comparison operation is going to have a noticeable effect relative to
the cost of reading from disk, decompressing data, and copying it
around in memory.

I also wonder whether there is more overhead in determining whether
$bytes>0 than there is determining whether $line is defined.

Benchmark it!

But I don't think it matters here.

And I don't know how 'getline()' and 'gzreadline()' compare - both in
terms of what they actually do, and in terms of how fast they do it.

Yes, well... that's half the question.

-ofer

odigity · Oct 28, 2004

I'm afraid the benchmark is useless. Benchmark doesn't count the
CPU time spent in children, so you're not catching the interesting
part in ext_gzip.

You're probably right. I need to redo this and just use straight clock time.

-ofer

Help using Compress::Zlib::memGunzip()	0	Sep 5, 2003
Need help with Compress::Zlib code (inflation gives error)	1	Nov 18, 2005
filehandle to a member of a zip archive	8	Jun 3, 2006
Uncompressing tar.Z files using Compress::Zlib?	5	Dec 7, 2005
script hangs when run from command line and redirecting stdout and stderr to file	2	Jan 5, 2006
Problem with perlcc	0	Sep 30, 2003
Lesen und Schreiben mit einem file handle O_RDWR geht nicht	2	Jan 25, 2004
perl problem with select and non-blocking sysread from multiple pipes	7	Mar 1, 2005

Compress::Zlib vs. external gzip call

odigity

Stuart Moore

Sisyphus

Anno Siegel

odigity

odigity

odigity

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads