Speed gap between zcat and zlib's GzipReader

D

David G. Andersen

I'm still in 1.8.1-land, so this may be old news, but
GzipReader is (painfully) slow compared to using zcat
to accomplish the same thing:

The code:

#!/scratch/ruby/bin/ruby

require 'zlib'

f = ARGV[0]

s = Time.new
infile = Zlib::GzipReader.new(File.new(f, "r"))
#infile = IO.popen("zcat #{f}", "r")
linecount = 0
infile.each_line { |l|
linecount += 1
}
e = Time.new
print "Read #{linecount} lines in #{e - s} seconds\n"

------------------------------

Tested on:
FreeBSD port-installed ruby 1.8.1
Freshly compiled 1.8.1
Freshly compiled 1.8.1 with CFLAGS=-O2
CVS version, CFLAGS=-O2

FBSD 1.8.1 1.8.1, O0 1.8.1 -O2 CVS, -O2
popen zcat: 2.3 2.3 2.3 2.3
GzipReader: 5.8 9.2 5.8 5.9

Yowza. Before I poke more, is this expected, or a known
slowness issue?

-Dave
 
D

David G. Andersen

I had a similar problem which was discussed here at length a year or
so ago. If you avoid the block setup and use a fixed-length read, it's
quite a bit quicker. Still nowhere near as fast as Perl though :-(.

Ahh, thanks. So the problem is really in GzipReader's each_line
handling. It's actually pretty close to as fast as it could go
when doing a fixed-length read. Byte-counting only, fixed-length
read; popen and gzipreader both take 1.4 seconds on my test file.
A zcat to /dev/null takes 1.18 seconds. Piping to 'wc' takes 1.83
seconds. No complaints.

gzfile_read is fast.
gzfile_read_more is fast (used by gzfile_read).
But gzreader_gets... is a dog. It does a memcmp()
on each byte of the input string to test it against
the delimiter - yow! So, it looks like zlib's "gets"
needs the equivalent of rb_io_getline_fast. Would
be nice if that were easily re-used, but the FILE *
access is buried pretty deep inside of it.

Guess I'll have to dig up some spare time next week. :)

-Dave
 
D

David G. Andersen

Ahh, thanks. So the problem is really in GzipReader's each_line
handling.
[...]
But gzreader_gets... is a dog. It does a memcmp()
on each byte of the input string to test it against

I've attached a patch that reduces some of the overhead
for files with longer lines (but doesn't fix all of the
slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
grabbing data out of the gzipped file with file.gets():

"tarfile" - compressed JDK. Line length is long (random data...)
"words" - /usr/share/dict/words gzipped. Lines are very short.
"logfile" - logfile from one of my experiments. Lines are
between 15 and 120 bytes long.

popen GzReader-orig GzReader-patched
----- ------------- ----------------
tarfile 2.06 5.65 2.95
words 0.914 2.4 2.22
logfile 1.18 3.65 2.27

The patch is tiny and non-intrusive, which is a bonus, though its
performance improvement is not spectacular for short lines. Helps
with gzipped logfiles, at least, but someone with more {time,
knowledge of ruby's internals} might want to go in and overhaul
things for real.

-Dave


--- orig-zlib.c Mon Oct 25 22:01:18 2004
+++ zlib.c Mon Oct 25 22:33:26 2004
@@ -2470,7 +2470,7 @@
{
struct gzfile *gz = get_gzfile(obj);
VALUE rs, dst;
- char *rsptr, *p;
+ char *rsptr, *p, *res;
long rslen, n;
int rspara;

@@ -2520,8 +2520,15 @@
gzfile_read_more(gz);
p = RSTRING(gz->z.buf)->ptr + n - rslen;
}
- if (memcmp(p, rsptr, rslen) == 0) break;
- p++, n++;
+ res = memchr(p, rsptr[0], (gz->z.buf_filled - n + 1));
+ if (!res) {
+ n = gz->z.buf_filled + 1;
+ } else {
+ n += (long)(res - p);
+ p = res;
+ if (rslen == 1 || memcmp(p, rsptr, rslen) == 0) break;
+ p++, n++;
+ }
}

gz->lineno++;
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Speed gap between zcat and zlib's GzipReader"

|I've attached a patch that reduces some of the overhead
|for files with longer lines (but doesn't fix all of the
|slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
|grabbing data out of the gzipped file with file.gets():

I'm impressed. I will merge your patch.

matz.
 

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top