suggestions for printing out a few records of a lengthy file

C

ccc31807

The input is a flat file (pipe separated) with thousands of records
and tens of columns, similar to this. The first column is a unique
key.

42546|First|Middle|Last|Street|City|State|Zip|Country|Attr1|Attr2|
Attr3 ...

The input is processed and the output consists of multi-page PDF
documents that combine the input file with other files. The other
files reference the unique key. I build a hash with the input files,
like this:

my %records;
while(<IN>)
{
my ($key, $first, $middle, $last ...) = split /\|/;
$record{$key} = {
first => $first,
middle => $middle,
last => $last,
...
}

Running this script results in thousands of PDF files. The client has
a need for individual documents, so I modified the script to accept a
unique key as a command line argument, which still reads the input
document until it matches the key, creates one hash element for the
key, and exits, like this:

#in the while loop
if ($key == $command_line_argument) {
#create hash element as above
last;
}

The client now has a need to create a small number of documents. I
capture the unique keys in @ARGV, but I don't know the best way to
select just those records. I can pre-create the hash like this:

foreach my $key (@ARVG)
(
$records{$key};
}

and in the while loop, doing this:

if(exists $records{$key})
{
#create hash element as above
}

but this still reads through the entire input file.

Is there a better way?

Thanks, CC.
 
U

Uri Guttman

c> The client now has a need to create a small number of documents. I
c> capture the unique keys in @ARGV, but I don't know the best way to
c> select just those records. I can pre-create the hash like this:

c> foreach my $key (@ARVG)
c> (
c> $records{$key};
c> }

c> and in the while loop, doing this:

c> if(exists $records{$key})
c> {
c> #create hash element as above
c> }

c> but this still reads through the entire input file.

use a read database or even a DBD for a csv (pipe separated is ok)
file.

or you could save a lot of space by reading in each row and only saving
that text line in a hash with the key (you extract only the key). then
you can locate the rows of interest, parse out the fields and do the
usual stuff.

uri
 
C

ccc31807

use a read database or even a DBD for a csv (pipe separated is ok)
file.

We get the input file dumped on us every other month or so, as an
ASCII file, and use it just once to create the PDFs. We never do any
update, delete, or insert queries, and only a few select queries, so
putting it into a RDB just to print maybe two dozen documents out of
thousands seems like a lot of effort for very little benefit.
or you could save a lot of space by reading in each row and only saving
that text line in a hash with the key (you extract only the key). then
you can locate the rows of interest, parse out the fields and do the
usual stuff.

This is what I thought I was doing. However, it occurs to me that I
can use a counter initially set to the size of @ARVG, decrement it for
every match, and exit when the counter reaches zero.

Thanks for your response, CC.
 
M

Martijn Lievaart

The input is a flat file (pipe separated) with thousands of records and
tens of columns, similar to this. The first column is a unique key.
(snip)

The client now has a need to create a small number of documents. I
capture the unique keys in @ARGV, but I don't know the best way to
select just those records. I can pre-create the hash like this:

foreach my $key (@ARVG)
(
$records{$key};
}

and in the while loop, doing this:

if(exists $records{$key})
{
#create hash element as above
}

but this still reads through the entire input file.

Is there a better way?

Better first ask yourself if there really is a problem. "Thousands" of
records sounds to me as peanuts, and very small peanuts at that.

[martijn@cow t]$ time perl -ne '($x, $y, $z) = split; $h{$x}{y}=$y; $h{$x}
{z}=$z' t.log

real 0m2.804s
user 0m2.750s
sys 0m0.043s
[martijn@cow t]$ wc -l t.log
670365 t.log
[martijn@cow t]$

YMMV, and the more you do in the loop the longer it takes. But still, the
seconds (at most!) you might shave off aren't worth your programmer time.

That said, there isn't a real good way to optimize it as well. Only if
you run dozens of runs with the same inputfile, it might make sens to
either create an index file, put it in a database or read it once and
store the hash with the Storable module for fast reread. (And all
solutions are the same use an index on disk).

M4
 
J

J. Gleixner

ccc31807 said:
We get the input file dumped on us every other month or so, as an
ASCII file, and use it just once to create the PDFs. We never do any
update, delete, or insert queries, and only a few select queries, so
putting it into a RDB just to print maybe two dozen documents out of
thousands seems like a lot of effort for very little benefit.


This is what I thought I was doing. However, it occurs to me that I
can use a counter initially set to the size of @ARVG, decrement it for
every match, and exit when the counter reaches zero.

No need for all that. You could create a hash of the keys passed in
via ARGV.

my %ids = map { $_ => 1 } @ARGV;

Then test if the key is one you're interested in:

while(<IN>)
{
my ($key, $first, $middle, $last ...) = split /\|/;
next unless $ids{ $key };
...
 
D

Dr.Ruud

J. Gleixner said:
You could create a hash of the keys passed in
via ARGV.

my %ids = map { $_ => 1 } @ARGV;

Then test if the key is one you're interested in:

while(<IN>)
{
my ($key, $first, $middle, $last ...) = split /\|/;
next unless $ids{ $key };
...

You could further first test if the line starts with something
interesting. If the key is for example at least 3 characters,
like: C<next unless $short{ substr $_, 0, 3 };>.

You can also (pre)process the file with a grep command.
 
U

Uri Guttman

c> We get the input file dumped on us every other month or so, as an
c> ASCII file, and use it just once to create the PDFs. We never do any
c> update, delete, or insert queries, and only a few select queries, so
c> putting it into a RDB just to print maybe two dozen documents out of
c> thousands seems like a lot of effort for very little benefit.

c> This is what I thought I was doing. However, it occurs to me that I
c> can use a counter initially set to the size of @ARVG, decrement it for
c> every match, and exit when the counter reaches zero.

that would save some time but an unknown amount as you don't know which
keys are needed and one could be the last one. if you want to do it that
way, even simpler is to make a hash of the needed keys from @ARGV. then
when you see a line with that key, process it to a pdf and delete that
entry from the hash. when the hash is empty, exit.

this also could run to the end of the file but it won't ever store more
than one line at a time so it is ram efficient.

uri
 
U

Uri Guttman

JG> No need for all that. You could create a hash of the keys passed in
JG> via ARGV.

JG> my %ids = map { $_ => 1 } @ARGV;

JG> Then test if the key is one you're interested in:

JG> while(<IN>)
JG> {
JG> my ($key, $first, $middle, $last ...) = split /\|/;
JG> next unless $ids{ $key };
JG> ...

same idea i had but you didn't add in deleting found keys so you can
exit early.

also no need to do a full split on the line unless you know it was in
the hash. only split after you find a needed line. you can easily grab
the key from the front of each line as it comes in.

uri
 
C

ccc31807

Better first ask yourself if there really is a problem. "Thousands" of
records sounds to me as peanuts, and very small peanuts at that.

You are right about that. Printing the PDFs takes far more time than
creating the hash in memory, and even if creating the full hash took
as much as a second it would be acceptable. My concern was really more
theoretical: why create a hash of some 50K elements when you only need
three?
YMMV, and the more you do in the loop the longer it takes. But still, the
seconds (at most!) you might shave off aren't worth your programmer time.

That's worth a smiley! I could just create the individual documents as
I have a modified script that will do just that. Again, it offends my
sense of frugality.
That said, there isn't a real good way to optimize it as well. Only if
you run dozens of runs with the same inputfile, it might make sens to
either create an index file, put it in a database or read it once and
store the hash with the Storable module for fast reread. (And all
solutions are the same use an index on disk).

Agreed. I don't have that much experience in development, and there
isn't a real functional need for optimization.

Again, thanks for your comments, CC.
 
D

Dr.Ruud

ccc31807 said:
Printing the PDFs takes far more time than
creating the hash in memory

On the related subject of creating nice PDFs:
we are using webkit for that for the last few years,
we create many-many thousands a day,
and we are very happy with the results.

Webkit interprets HTML with a decent support of CSS,
which makes it real easy to generate the source
from which the PDF will be created.
 
C

C.DeRykus

  JG> ccc31807 wrote:
  >>> use a read database or even a DBD for a csv (pipe separated is ok)
  >>> file.
  >>
  >> We get the input file dumped on us every other month or so, as an
  >> ASCII file, and use it just once to create the PDFs. We never do any
  >> update, delete, or insert queries, and only a few select queries, so
  >> putting it into a RDB just to print maybe two dozen documents out of
  >> thousands seems like a lot of effort for very little benefit.
  >>
  >>> or you could save a lot of space by reading in each row and only saving
  >>> that text line in a hash with the key (you extract only the key).then
  >>> you can locate the rows of interest, parse out the fields and do the
  >>> usual stuff.
  >>
  >> This is what I thought I was doing. However, it occurs to me that I
  >> can use a counter initially set to the size of @ARVG, decrement itfor
  >> every match, and exit when the counter reaches zero.

  JG> No need for all that. You could create a hash of the keys passed in
  JG> via ARGV.

  JG> my %ids = map { $_ => 1 } @ARGV;

  JG> Then test if the key is one you're interested in:

  JG> while(<IN>)
  JG> {
  JG>   my ($key, $first, $middle, $last ...) = split /\|/;
  JG>   next unless $ids{ $key };
  JG>   ...

same idea i had but you didn't add in deleting found keys so you can
exit early.

also no need to do a full split on the line unless you know it was in
the hash. only split after you find a needed line. you can easily grab
the key from the front of each line as it comes in.

A match with \G and /gc is an alternative
to avoid a split() of the whole line:

while (<IN>)
{
my ($key) = m{\G(\d+)\|}gc;
next unless defined $key and $ids{ $key };
my @rest = m{\|\G([^|]+)}gc;
...
}

A regex and split() may not be as
efficient but is much easier though:


while (<IN>)
{
my ($key) = /^(\d+)/;
next unless...
my( undef, @rest ) = split (/\|/, $_ );
...
}
 
P

Peter J. Holzer

On the related subject of creating nice PDFs:
we are using webkit for that for the last few years,
we create many-many thousands a day,
and we are very happy with the results.

Sounds interesting. Which perl module do you use (there are several on
CPAN, but the descriptions don't look promising)?

hp
 
P

Peter J. Holzer

Not a module, per se, but I've had success with wkhtmltopdf. See
http://code.google.com/p/wkhtmltopdf/ for more info.


Thanks, but after playing with it for a bit I found two problems:

1) It pretends to be a screen device, not a printing device (so for a
stylesheet which contain both @media print and @media screen sections
it chooses the wrong ones).
2) It sometimes makes a pagebreak in the middle of a line (so the upper
half of the line is on page 1 and the lower half of the line is on
page 2).

It looks like the tool renders the page the same way as a browser on
screen and then cuts the result into pages.

hp
 
D

Dr.Ruud

Peter said:
Thanks, but after playing with it for a bit I found two problems:

1) It pretends to be a screen device, not a printing device (so for a
stylesheet which contain both @media print and @media screen sections
it chooses the wrong ones).
2) It sometimes makes a pagebreak in the middle of a line (so the upper
half of the line is on page 1 and the lower half of the line is on
page 2).

It looks like the tool renders the page the same way as a browser on
screen and then cuts the result into pages.

This should help:
--print-media-type
"page-break-inside: avoid;"
http://www.smashingmagazine.com/2007/02/21/printing-the-web-solutions-and-techniques/
http://code.google.com/p/wkhtmltopdf/issues/detail?id=9
http://code.google.com/p/wkhtmltopdf/issues/detail?id=57
http://search.cpan.org/~tbr/WKHTMLTOPDF-0.02/lib/WKHTMLTOPDF.pm
 
P

Peter J. Holzer

This should help:
--print-media-type

That was the option I was looking for. I guess I didn't expect to find
an option which I consider extremely important (in fact, I think it
should be the default) to be hidden under "less common command
switches".

"page-break-inside: avoid;"

I see that I wasn't clear enough what I meant with "a pagebreak in the
middle of a line", so some screenshots may help:

http://www.hjp.at/junk/ss-wkhtmltopdf1.png
http://www.hjp.at/junk/ss-wkhtmltopdf2.png

As you can see, the last line of the page is split *horizontally*
slightly above the baseline in both cases - the descenders appear at the
top of the next page. That's clearly a bug and not something
"page-break-inside: avoid;" is supposed to fix. "page-break-inside:
avoid;" avoids pagebreaks within an element, e.g. a paragraph, but that
isn't the problem here.


Nice collection of links, although I'm not sure why you mention them.

Yup, my problem number 2 is mentioned in comment 4 here. I already found
that before posting.

Different problem.

Ouch! My eyes! Couldn't he have named the thing WkHTMLtoPDF of
WkHtmlToPdf, or something? ;-).

hp
 
D

Dr.Ruud

Peter said:
That was the option I was looking for. I guess I didn't expect to find
an option which I consider extremely important (in fact, I think it
should be the default) to be hidden under "less common command
switches".

Yes, I also don't understand why "they" did it like that, it makes it
all unnecessary less easy to understand.
But it still all works reasonably well, we create many thousands of
unique PDFs daily with it.

"page-break-inside: avoid;"

I see that I wasn't clear enough what I meant with "a pagebreak in the
middle of a line" [...]
the last line of the page is split *horizontally*
slightly above the baseline

That's what I understood, and I assumed that you could prevent that by
giving the element that attribute. BTW, the default page size is A4.


The manual says:

<quote>
Page Breaking

The current page breaking algorithm of WebKit leaves much to be
desired. Basically webkit will render everything into one long page,
and then cut it up into pages. This means that if you have two columns
of text where one is vertically shifted by half a line, then webkit
will cut a line into to pieces display the top half on one page, and
the bottom half on another page. It will also break image in two and so
on. If you are using the patched version of QT you can use the CSS
page-break-inside property to remedy this somewhat. There is no easy
solution to this problem, until this is solved try organising your HTML
documents such that it contains many lines on which pages can be cut
cleanly.

See also:
<http://code.google.com/p/wkhtmltopdf/issues/detail?id=9>,
<http://code.google.com/p/wkhtmltopdf/issues/detail?id=33> and
<http://code.google.com/p/wkhtmltopdf/issues/detail?id=57>.
</quote>

Fonts (and Qt's QPrinter::ScreenResolution) also can cause issues:
http://code.google.com/p/wkhtmltopdf/issues/detail?id=72
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top