Finding number of file from gzip'ed format

sopan.shewale · Feb 29, 2008

Hi,

I am not sure if this is the right group to ask this question - i am
sorry if this is not the right place.

Problem: Let us say we have file called "myfile.txt". The size of the
file is huge. The file is gziped - the gziped filename is
"myfile.txt.gz". I am interested to find the number of lines of
myfile.txt from myfile.txt.gz without gunziping it.

I know if it is allowed to gunzip then just use "gunzip -c
myfile.txt.gz | wc -l" this can give the number of lines.

My problem is time taken to gunzip is huge file is very large.

Is there any way to count the number of lines using Perl script/Any
other method - just to figure out number of "\n" chars hidden inside
the file-use something from the algorithm of gzip?

Appreciate your time efforts to read the problem and thank you so much
for investing time to read this problem.

Please help me with solution or pointers to read (already reading
http://www.gzip.org/algorithm.txt).

--sopan

xhoster · Feb 29, 2008

[email protected] said:
Hi,

I am not sure if this is the right group to ask this question - i am
sorry if this is not the right place.

Problem: Let us say we have file called "myfile.txt". The size of the
file is huge. The file is gziped - the gziped filename is
"myfile.txt.gz". I am interested to find the number of lines of
myfile.txt from myfile.txt.gz without gunziping it.

I know if it is allowed to gunzip then just use "gunzip -c
myfile.txt.gz | wc -l" this can give the number of lines.

My problem is time taken to gunzip is huge file is very large.

That is about as good as it is going to get.

Is there any way to count the number of lines using Perl script/Any
other method - just to figure out number of "\n" chars hidden inside
the file-use something from the algorithm of gzip?

You *might* be able to come up with a shortcut that is integrated in
with the very guts of the Lempel-Ziv 77 algorithm that would allow you
to count the "\n" without actually doing the unzip, but I'm skeptical
that you could make it meaningfully faster than just gunzipping (or gzcat).
And if you try to do so in Perl rather than C, then I'm rather confident
that it would be a lot slower.

Perhaps you should pre-compute and then cache the number of lines
someplace.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Ted Zlatanov · Mar 3, 2008

ssc> Hi,
ssc> I am not sure if this is the right group to ask this question - i am
ssc> sorry if this is not the right place.

ssc> Problem: Let us say we have file called "myfile.txt". The size of the
ssc> file is huge. The file is gziped - the gziped filename is
ssc> "myfile.txt.gz". I am interested to find the number of lines of
ssc> myfile.txt from myfile.txt.gz without gunziping it.

ssc> I know if it is allowed to gunzip then just use "gunzip -c
ssc> myfile.txt.gz | wc -l" this can give the number of lines.

ssc> My problem is time taken to gunzip is huge file is very large.

ssc> Is there any way to count the number of lines using Perl script/Any
ssc> other method - just to figure out number of "\n" chars hidden inside
ssc> the file-use something from the algorithm of gzip?

ssc> Appreciate your time efforts to read the problem and thank you so much
ssc> for investing time to read this problem.

ssc> Please help me with solution or pointers to read (already reading
ssc> http://www.gzip.org/algorithm.txt).

If the only metadata you'll need is the number of lines, just rename to
myfile.N.txt.gz where N is the number of lines. So, if there is no N
you have to count (you can't avoid that cost, because newlines are just
content), but if N is already calculated you're done. Obviously if you
modify the file you recalculate N, but a compressed file is unlikely to
be modified in place.

It's not a long-term solution and it will only work for this one piece
of data, but it's easy to implement.

The question is, why do you need to count newlines? If you specifically
need to show exact statistics about how many lines are in the file,
you're stuck. But you can at least approximate from the file size and
average bytes per line over the first 5000 lines.

Users really appreciate interactive applications. If instead of doing
the wc -l and THEN displaying it, you maintain a running counter of the
number of lines and update the screen with the new value periodically, I
guarantee you that users won't mind it much.

You could even show a progress bar using the average bytes per line so
far, or the much easier Zeno's paradox progress bar (every update adds
50% of the remainder, so you do 50%, 75%, 87.5%, etc.).

Ted

Sort by number of characters	0	Nov 3, 2023
I am having trouble finding a method of using the git enterprise api to scrape data from projects	1	Jun 1, 2023
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
How to fix ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2570)	0	Jul 28, 2023
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
How do i edit the log file format for the "Geogebra Classic 6 Exam Mode"?	0	Apr 27, 2023
How to get day (number), month (number) and year from a date using month's french name?	3	Feb 5, 2023
A number everyday of the month "and" a different number depending on the day of the month´s day time	2	Mar 16, 2021

Finding number of file from gzip'ed format

sopan.shewale

xhoster

Ted Zlatanov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads