Problem in parsing from a pipe

January Weiner · Apr 21, 2009

Hello there,

I have a small problem, which I can actually solve quite easily, but I am
looking for a more elegant solution.

I am reading from a pipe
open( my $if, "$cmd|" ) or die "Cannot run $cmd: $!\n" ;

The output from this command is a series of records. Each record starts
with a recognizable line (e.g. I could catch it with /^Name=(\S+)/), but
does not have a "record end" marker. That is, I can only tell that a record
ended when a new record starts or when EOF is reached.

Now, I want to process the records one by one in one subroutine, calling
another one ("parse_record") to read exactly one record and return it:

while( my $record = parse_record( $if ) ) {

# ... do something with $record

}

Normally, when reading a regular file, I do something like this in the
parse_record() function (see below[1] for example code): whenever I read a
line from $if, I store the position in file returned by tell() ; if I find
the beginning of the next record, I seek to the position prior to the
current line and return the current record.

Unfortunately, I can't seek in a pipe. What I do instead is to return two
records ( the "current one", completely read, and the "next one", with a
stub from parsing the "record start" line), and then pass the "next record"
information to the parse_record() subroutine.

This is not elegant, as I would like to have the while() loop above to be
completely unaware of details of the parsing (e.g. I want to use it with
different parsers and file types).

One other solution that I was thinking of is to store this "next record"
line in a static variable. For example, I could make the whole parser OO,
and stored this "buffer" in a private variable.

Any other thing that I could do?

j.

[1] Example code 1:

sub parse_record {
my ( $if ) = @_ ;

my $cur_record ;
my $fpos ;

while( <$if> ) {
if( /^Name=(\S+)/ ) {
if( $cur_record ) { # we already have a record defined
seek $if, $fpos, 0 ; # rewind so that the next instance of
# parse_record
# we know $fpos is defined, because we have
# a previous record
return $cur_record ;
}

$cur_record = { name => $1 } ;
$fpos = tell $if ;
next ;
}

next unless $cur_record ;
$fpos = $cur_record ;

# read the rest of the record here...
}

return $cur_record ;
}

Eric Pozharski · Apr 22, 2009

I have a small problem, which I can actually solve quite easily, but I am
looking for a more elegant solution.

I am reading from a pipe
open( my $if, "$cmd|" ) or die "Cannot run $cmd: $!\n" ;

The output from this command is a series of records. Each record starts
with a recognizable line (e.g. I could catch it with /^Name=(\S+)/), but
does not have a "record end" marker. That is, I can only tell that a record
ended when a new record starts or when EOF is reached.

"Loop Control" said:
Now, I want to process the records one by one in one subroutine, calling
another one ("parse_record") to read exactly one record and return it:

If you could weaken that requirement

use File::Slurp;

and then C<m{^Name=(.+?)(?=^Name=)}smg>.

p.s. And, obviously, what Ben suggested.

*CUT*

January Weiner · Apr 22, 2009

Think 3-args B<open>, it's safer.
Why?

F<perlsyn>, "Loop Control", 3rd paragraph -- B<redo>. The section has
clear example of usage.

Yes, that is very interesting, thanks.

If you could weaken that requirement

use File::Slurp;

No. Firstly, files are quite large. Secondly, I want to monitor the
progress of the parser.

Thanks.

j.

Eric Pozharski · Apr 24, 2009

Why?

perl -wle '$cmd = q|rm -rf /; true|; open $fh, "$cmd|" or die $!'
Name "main::fh" used only once: possible typo at -e line 1.
rm: cannot remove root directory `/'

*SKIP*

No. Firstly, files are quite large. Secondly, I want to monitor the
progress of the parser.

Define "quite large". As of second, I think, it's possible to go
through pattern one match per time (but not at 3AM).

And a piece of advice. If you're going to stay here, anytime think of
C<use File::Slurp;>, and find a good reason against. Because sooner or
later, you'll be adviced of it anyway.

January Weiner · Apr 27, 2009

perl -wle '$cmd = q|rm -rf /; true|; open $fh, "$cmd|" or die $!'
Name "main::fh" used only once: possible typo at -e line 1.
rm: cannot remove root directory `/'

Maybe I'm not too bright, but I don't get it

Would you mind being a
little more verbose? I mean, you can do the same with 3 arg open:

perl -wle '$cmd = q|rm -rf /; true|; open( $fh, "-|", "$cmd" ) or die $!'

Where is the difference? I understand that using open( $fh, $file )
instead of open( $fh, "<$file" ) can in some cases lead to problems (if
$file becomes ">something"), but in this particular case we are reading
from a pipe anyways, and if the $cmd has been manipulated (and we were
careless and haven't checked it) than the tree args version will not be of
any help.

And anyway, I have always thought that preventing malicious input from the
users should be happening on an altogether different level, starting with
at least using taint mode -- am I wrong?

Define "quite large". As of second, I think, it's possible to go
through pattern one match per time (but not at 3AM).

$ du -hs human_est.out
1.2G human_est.out
$ du -hs nr
3.5G nr

And a piece of advice. If you're going to stay here, anytime think of
C<use File::Slurp;>, and find a good reason against. Because sooner or
later, you'll be adviced of it anyway.

Maybe, but this is not going to happen. I want to stop reading a huge file
after I have collected all the information that I need from it - why should
I slurp 3.5 gb if I have what I need after reading 10k?

j.

Eric Pozharski · Apr 28, 2009

Maybe I'm not too bright, but I don't get it Would you mind being a
little more verbose? I mean, you can do the same with 3 arg open:

perl -wle '$cmd = q|rm -rf /; true|; open( $fh, "-|", "$cmd" ) or die $!'

Where is the difference? I understand that using open( $fh, $file )
instead of open( $fh, "<$file" ) can in some cases lead to problems (if
$file becomes ">something"), but in this particular case we are reading

Think C<"|something">. That would result in "Can't open biderectional
pipe"... warning. While opening C<"something|"> pipe for writing.
What would be run via shell with output of F<something> just going

through I said:
from a pipe anyways, and if the $cmd has been manipulated (and we were
careless and haven't checked it) than the tree args version will not be of
any help.

Forget what I've said. 3-arg B<open> is no-way safer, in this regard.
Splitting on spaces wouldn't help in all cases (while, I suppose, in
most). F<perlipc> suggests going B<fork>/B<exec> to avoid shell
invocation. What I do.

So, let me rephrase: 3-arg B<open> avoids misinterpretting redirecting
metachars as a mode specs, while stays with shell for pipes. Then --
3-arg B<open> used consistently (or constantly) is just a matter of
habit.

I've trusted Perl that much. What a sad day.

And anyway, I have always thought that preventing malicious input from the
users should be happening on an altogether different level, starting with
at least using taint mode -- am I wrong?

No and yes. (maybe I'm wrong, again) A tainted string just indicates
that it wasn't preprocessed. While amount of preprocessing is left at
coders option. I haven't fought taintedness a lot. Quite simple (but
non-trivial, in my case) regexp removes taintedness. Does it make a
string safe? Who knows, it depends on task.

$ du -hs human_est.out
1.2G human_est.out
$ du -hs nr
3.5G nr

Define "quite large". (I think, sizes are in bytes).

perl -wle '
open $fh, "<", "/proc/$$/stat";
print +(split / /, <$fh>)[22]'
5775360

time perl -wle '
$x = " " x (512 * 1E6);
open $fh, "<", "/proc/$$/stat";
print +(split / /, <$fh>)[22]'
Name "main::x" used only once: possible typo at -e line 2.
1029783552

real 1m51.687s
user 0m3.588s
sys 0m7.068s

time perl -wle '
$x = " " x (256 * 1E6);
open $fh, "<", "/proc/$$/stat";
print +(split / /, <$fh>)[22]'
Name "main::x" used only once: possible typo at -e line 2.
517783552

real 0m11.334s
user 0m1.788s
sys 0m1.668s

I have only 512Mb real memory. However, looking at B<time> output I
should agree, that loading even 1.2G (virtual memory provided) would be
quite exciting. Remember, that's string but array.

Maybe, but this is not going to happen. I want to stop reading a huge file
after I have collected all the information that I need from it - why should
I slurp 3.5 gb if I have what I need after reading 10k?

I wasn't about quitting the file. I was about quitting c.l.p.m.

Peter J. Holzer · Apr 28, 2009

Maybe I'm not too bright, but I don't get it Would you mind being a
little more verbose? I mean, you can do the same with 3 arg open:

perl -wle '$cmd = q|rm -rf /; true|; open( $fh, "-|", "$cmd" ) or die $!'

Where is the difference?

None that I can see at the moment. But there is a diffence between

my $file = get_filename_from_user();
open $fh, "$cmd $file|" or die;

and

my $file = get_filename_from_user();
open $fh, '-|', $cmd, $file or die;

In the first case, if the user enters "/dev/null; rm -rf /" as the file
name, the command "rm -rf /" will be executed, while in the second case
the single argument "/dev/null; rm -rf /" will be passed to $cmd (which
will probably complain that there is no file with this name).

Of course, if the user already has a shell and invokes your script
interactively, that doesn't make any difference, either: The user can
simply invoke "rm -rf /" from the shell with exactly the same result.

However it is a good idea to err on the side of caution, because some
time later you might want to reuse your code in a library, CGI script
or cron job and then you must be paranoid to avoid having your system
wiped out by accident or malice. So get into the practice of being
paranoid!

And anyway, I have always thought that preventing malicious input from the
users should be happening on an altogether different level, starting with
at least using taint mode -- am I wrong?

That, too. But taint mode is only a tool which helps you to detect
untrusted input. It isn't foolproof.

$ du -hs human_est.out
1.2G human_est.out
$ du -hs nr
3.5G nr

Maybe, but this is not going to happen. I want to stop reading a huge file
after I have collected all the information that I need from it - why should
I slurp 3.5 gb if I have what I need after reading 10k?

You shouldn't. There are situations where it is a good idea to read the
whole file into memory and there are situations where it isn't. Clearly
if you only need the first 10k of a 3.5GB file it would be insane to
all of it into memory. Even if you need to read the whole file it it
probably better to read it line by line. But that depends on your data
and what information you need to extract. "Always slurp" is just as
idiotic as "always read line by line".

hp

Dynamic block parsing + scrolling	0	May 30, 2024
Dynamic block parsing + scrolling	0	May 30, 2024
C pipe	1	Dec 9, 2021
How can I structure the final array to meet the requirements of Bootstrap Tree View for building a tree in JavaScript?	1	Mar 29, 2024
How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
Problem Splitting Text String	2	Dec 29, 2022
exec and named pipe questions	11	Sep 7, 2013
How can I guarantee that the all callback functions of the first Ajax API call have finished executing before initiating the 2 call in JavaScript?	2	Oct 30, 2023

Problem in parsing from a pipe

January Weiner

Eric Pozharski

January Weiner

Eric Pozharski

January Weiner

Eric Pozharski

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads