Problem in parsing from a pipe

J

January Weiner

Hello there,

I have a small problem, which I can actually solve quite easily, but I am
looking for a more elegant solution.

I am reading from a pipe
open( my $if, "$cmd|" ) or die "Cannot run $cmd: $!\n" ;

The output from this command is a series of records. Each record starts
with a recognizable line (e.g. I could catch it with /^Name=(\S+)/), but
does not have a "record end" marker. That is, I can only tell that a record
ended when a new record starts or when EOF is reached.

Now, I want to process the records one by one in one subroutine, calling
another one ("parse_record") to read exactly one record and return it:

while( my $record = parse_record( $if ) ) {

# ... do something with $record

}

Normally, when reading a regular file, I do something like this in the
parse_record() function (see below[1] for example code): whenever I read a
line from $if, I store the position in file returned by tell() ; if I find
the beginning of the next record, I seek to the position prior to the
current line and return the current record.

Unfortunately, I can't seek in a pipe. What I do instead is to return two
records ( the "current one", completely read, and the "next one", with a
stub from parsing the "record start" line), and then pass the "next record"
information to the parse_record() subroutine.

This is not elegant, as I would like to have the while() loop above to be
completely unaware of details of the parsing (e.g. I want to use it with
different parsers and file types).

One other solution that I was thinking of is to store this "next record"
line in a static variable. For example, I could make the whole parser OO,
and stored this "buffer" in a private variable.

Any other thing that I could do?

j.

[1] Example code 1:

sub parse_record {
my ( $if ) = @_ ;

my $cur_record ;
my $fpos ;

while( <$if> ) {
if( /^Name=(\S+)/ ) {
if( $cur_record ) { # we already have a record defined
seek $if, $fpos, 0 ; # rewind so that the next instance of
# parse_record
# we know $fpos is defined, because we have
# a previous record
return $cur_record ;
}

$cur_record = { name => $1 } ;
$fpos = tell $if ;
next ;
}

next unless $cur_record ;
$fpos = $cur_record ;

# read the rest of the record here...
}

return $cur_record ;
}
 
E

Eric Pozharski

I have a small problem, which I can actually solve quite easily, but I am
looking for a more elegant solution.

I am reading from a pipe
open( my $if, "$cmd|" ) or die "Cannot run $cmd: $!\n" ;

The output from this command is a series of records. Each record starts
with a recognizable line (e.g. I could catch it with /^Name=(\S+)/), but
does not have a "record end" marker. That is, I can only tell that a record
ended when a new record starts or when EOF is reached.

"Loop Control" said:
Now, I want to process the records one by one in one subroutine, calling
another one ("parse_record") to read exactly one record and return it:

If you could weaken that requirement

use File::Slurp;

and then C<m{^Name=(.+?)(?=^Name=)}smg>.

p.s. And, obviously, what Ben suggested.

*CUT*
 
J

January Weiner

Think 3-args B<open>, it's safer.
Why?


F<perlsyn>, "Loop Control", 3rd paragraph -- B<redo>. The section has
clear example of usage.

Yes, that is very interesting, thanks.
If you could weaken that requirement

use File::Slurp;

No. Firstly, files are quite large. Secondly, I want to monitor the
progress of the parser.

Thanks.

j.
 
E

Eric Pozharski


perl -wle '$cmd = q|rm -rf /; true|; open $fh, "$cmd|" or die $!'
Name "main::fh" used only once: possible typo at -e line 1.
rm: cannot remove root directory `/'

*SKIP*
No. Firstly, files are quite large. Secondly, I want to monitor the
progress of the parser.

Define "quite large". As of second, I think, it's possible to go
through pattern one match per time (but not at 3AM).

And a piece of advice. If you're going to stay here, anytime think of
C<use File::Slurp;>, and find a good reason against. Because sooner or
later, you'll be adviced of it anyway.
 
J

January Weiner

perl -wle '$cmd = q|rm -rf /; true|; open $fh, "$cmd|" or die $!'
Name "main::fh" used only once: possible typo at -e line 1.
rm: cannot remove root directory `/'

Maybe I'm not too bright, but I don't get it :( Would you mind being a
little more verbose? I mean, you can do the same with 3 arg open:

perl -wle '$cmd = q|rm -rf /; true|; open( $fh, "-|", "$cmd" ) or die $!'

Where is the difference? I understand that using open( $fh, $file )
instead of open( $fh, "<$file" ) can in some cases lead to problems (if
$file becomes ">something"), but in this particular case we are reading
from a pipe anyways, and if the $cmd has been manipulated (and we were
careless and haven't checked it) than the tree args version will not be of
any help.

And anyway, I have always thought that preventing malicious input from the
users should be happening on an altogether different level, starting with
at least using taint mode -- am I wrong?
Define "quite large". As of second, I think, it's possible to go
through pattern one match per time (but not at 3AM).

$ du -hs human_est.out
1.2G human_est.out
$ du -hs nr
3.5G nr
And a piece of advice. If you're going to stay here, anytime think of
C<use File::Slurp;>, and find a good reason against. Because sooner or
later, you'll be adviced of it anyway.

Maybe, but this is not going to happen. I want to stop reading a huge file
after I have collected all the information that I need from it - why should
I slurp 3.5 gb if I have what I need after reading 10k?

j.
 
E

Eric Pozharski

Maybe I'm not too bright, but I don't get it :( Would you mind being a
little more verbose? I mean, you can do the same with 3 arg open:

perl -wle '$cmd = q|rm -rf /; true|; open( $fh, "-|", "$cmd" ) or die $!'

Where is the difference? I understand that using open( $fh, $file )
instead of open( $fh, "<$file" ) can in some cases lead to problems (if
$file becomes ">something"), but in this particular case we are reading

Think C<"|something">. That would result in "Can't open biderectional
pipe"... warning. While opening C<"something|"> pipe for writing.
What would be run via shell with output of F<something> just going
through I said:
from a pipe anyways, and if the $cmd has been manipulated (and we were
careless and haven't checked it) than the tree args version will not be of
any help.

Forget what I've said. 3-arg B<open> is no-way safer, in this regard.
Splitting on spaces wouldn't help in all cases (while, I suppose, in
most). F<perlipc> suggests going B<fork>/B<exec> to avoid shell
invocation. What I do.

So, let me rephrase: 3-arg B<open> avoids misinterpretting redirecting
metachars as a mode specs, while stays with shell for pipes. Then --
3-arg B<open> used consistently (or constantly) is just a matter of
habit.

I've trusted Perl that much. What a sad day.
And anyway, I have always thought that preventing malicious input from the
users should be happening on an altogether different level, starting with
at least using taint mode -- am I wrong?

No and yes. (maybe I'm wrong, again) A tainted string just indicates
that it wasn't preprocessed. While amount of preprocessing is left at
coders option. I haven't fought taintedness a lot. Quite simple (but
non-trivial, in my case) regexp removes taintedness. Does it make a
string safe? Who knows, it depends on task.
$ du -hs human_est.out
1.2G human_est.out
$ du -hs nr
3.5G nr

Define "quite large". (I think, sizes are in bytes).

perl -wle '
open $fh, "<", "/proc/$$/stat";
print +(split / /, <$fh>)[22]'
5775360

time perl -wle '
$x = " " x (512 * 1E6);
open $fh, "<", "/proc/$$/stat";
print +(split / /, <$fh>)[22]'
Name "main::x" used only once: possible typo at -e line 2.
1029783552

real 1m51.687s
user 0m3.588s
sys 0m7.068s

time perl -wle '
$x = " " x (256 * 1E6);
open $fh, "<", "/proc/$$/stat";
print +(split / /, <$fh>)[22]'
Name "main::x" used only once: possible typo at -e line 2.
517783552

real 0m11.334s
user 0m1.788s
sys 0m1.668s

I have only 512Mb real memory. However, looking at B<time> output I
should agree, that loading even 1.2G (virtual memory provided) would be
quite exciting. Remember, that's string but array.
Maybe, but this is not going to happen. I want to stop reading a huge file
after I have collected all the information that I need from it - why should
I slurp 3.5 gb if I have what I need after reading 10k?

I wasn't about quitting the file. I was about quitting c.l.p.m.
 
P

Peter J. Holzer

Maybe I'm not too bright, but I don't get it :( Would you mind being a
little more verbose? I mean, you can do the same with 3 arg open:

perl -wle '$cmd = q|rm -rf /; true|; open( $fh, "-|", "$cmd" ) or die $!'

Where is the difference?

None that I can see at the moment. But there is a diffence between

my $file = get_filename_from_user();
open $fh, "$cmd $file|" or die;

and

my $file = get_filename_from_user();
open $fh, '-|', $cmd, $file or die;

In the first case, if the user enters "/dev/null; rm -rf /" as the file
name, the command "rm -rf /" will be executed, while in the second case
the single argument "/dev/null; rm -rf /" will be passed to $cmd (which
will probably complain that there is no file with this name).

Of course, if the user already has a shell and invokes your script
interactively, that doesn't make any difference, either: The user can
simply invoke "rm -rf /" from the shell with exactly the same result.

However it is a good idea to err on the side of caution, because some
time later you might want to reuse your code in a library, CGI script
or cron job and then you must be paranoid to avoid having your system
wiped out by accident or malice. So get into the practice of being
paranoid!

And anyway, I have always thought that preventing malicious input from the
users should be happening on an altogether different level, starting with
at least using taint mode -- am I wrong?

That, too. But taint mode is only a tool which helps you to detect
untrusted input. It isn't foolproof.
$ du -hs human_est.out
1.2G human_est.out
$ du -hs nr
3.5G nr


Maybe, but this is not going to happen. I want to stop reading a huge file
after I have collected all the information that I need from it - why should
I slurp 3.5 gb if I have what I need after reading 10k?

You shouldn't. There are situations where it is a good idea to read the
whole file into memory and there are situations where it isn't. Clearly
if you only need the first 10k of a 3.5GB file it would be insane to
all of it into memory. Even if you need to read the whole file it it
probably better to read it line by line. But that depends on your data
and what information you need to extract. "Always slurp" is just as
idiotic as "always read line by line".

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top