Finding number of file from gzip'ed format

S

sopan.shewale

Hi,

I am not sure if this is the right group to ask this question - i am
sorry if this is not the right place.

Problem: Let us say we have file called "myfile.txt". The size of the
file is huge. The file is gziped - the gziped filename is
"myfile.txt.gz". I am interested to find the number of lines of
myfile.txt from myfile.txt.gz without gunziping it.

I know if it is allowed to gunzip then just use "gunzip -c
myfile.txt.gz | wc -l" this can give the number of lines.

My problem is time taken to gunzip is huge file is very large.

Is there any way to count the number of lines using Perl script/Any
other method - just to figure out number of "\n" chars hidden inside
the file-use something from the algorithm of gzip?

Appreciate your time efforts to read the problem and thank you so much
for investing time to read this problem.

Please help me with solution or pointers to read (already reading
http://www.gzip.org/algorithm.txt).


--sopan
 
X

xhoster

Hi,

I am not sure if this is the right group to ask this question - i am
sorry if this is not the right place.

Problem: Let us say we have file called "myfile.txt". The size of the
file is huge. The file is gziped - the gziped filename is
"myfile.txt.gz". I am interested to find the number of lines of
myfile.txt from myfile.txt.gz without gunziping it.

I know if it is allowed to gunzip then just use "gunzip -c
myfile.txt.gz | wc -l" this can give the number of lines.

My problem is time taken to gunzip is huge file is very large.

That is about as good as it is going to get.
Is there any way to count the number of lines using Perl script/Any
other method - just to figure out number of "\n" chars hidden inside
the file-use something from the algorithm of gzip?

You *might* be able to come up with a shortcut that is integrated in
with the very guts of the Lempel-Ziv 77 algorithm that would allow you
to count the "\n" without actually doing the unzip, but I'm skeptical
that you could make it meaningfully faster than just gunzipping (or gzcat).
And if you try to do so in Perl rather than C, then I'm rather confident
that it would be a lot slower.

Perhaps you should pre-compute and then cache the number of lines
someplace.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
T

Ted Zlatanov

ssc> Hi,
ssc> I am not sure if this is the right group to ask this question - i am
ssc> sorry if this is not the right place.

ssc> Problem: Let us say we have file called "myfile.txt". The size of the
ssc> file is huge. The file is gziped - the gziped filename is
ssc> "myfile.txt.gz". I am interested to find the number of lines of
ssc> myfile.txt from myfile.txt.gz without gunziping it.

ssc> I know if it is allowed to gunzip then just use "gunzip -c
ssc> myfile.txt.gz | wc -l" this can give the number of lines.

ssc> My problem is time taken to gunzip is huge file is very large.

ssc> Is there any way to count the number of lines using Perl script/Any
ssc> other method - just to figure out number of "\n" chars hidden inside
ssc> the file-use something from the algorithm of gzip?

ssc> Appreciate your time efforts to read the problem and thank you so much
ssc> for investing time to read this problem.

ssc> Please help me with solution or pointers to read (already reading
ssc> http://www.gzip.org/algorithm.txt).

If the only metadata you'll need is the number of lines, just rename to
myfile.N.txt.gz where N is the number of lines. So, if there is no N
you have to count (you can't avoid that cost, because newlines are just
content), but if N is already calculated you're done. Obviously if you
modify the file you recalculate N, but a compressed file is unlikely to
be modified in place.

It's not a long-term solution and it will only work for this one piece
of data, but it's easy to implement.

The question is, why do you need to count newlines? If you specifically
need to show exact statistics about how many lines are in the file,
you're stuck. But you can at least approximate from the file size and
average bytes per line over the first 5000 lines.

Users really appreciate interactive applications. If instead of doing
the wc -l and THEN displaying it, you maintain a running counter of the
number of lines and update the screen with the new value periodically, I
guarantee you that users won't mind it much.

You could even show a progress bar using the average bytes per line so
far, or the much easier Zeno's paradox progress bar (every update adds
50% of the remainder, so you do 50%, 75%, 87.5%, etc.).

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,731
Latest member
MarcyGipso

Latest Threads

Top