Newbie question: most efficient way to search fields of this file

M

martin

Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.


I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.


col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
.... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

thanks. martin
 
X

xhoster

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?
What is the most efficient way to do this,

Efficient in your time? (depends on how good you are in every possible
language). Efficient on the computer's time? Probably assembly language.

is there a built in
function in perl that does it,

No, not directly.
how can I perl script this? should I turn the file into an array line
by line,

No. There is no reason to slurp it (if that is what you meant), although
it probably wouldn't hurt much for 1000 lines of 1000 numbers. And
certainly no reason to split every line, including the ones you don't care
about, if that is what you meant.
or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Something like that, sure.

Assuming the list of rows is in order, something like this:

my $col=4;
my @rowlist=(2,5,10)
$col--; # convert to 0-started array
foreach my $line (@rowlist) {
defined ($_=<$fh>) or die until /^row${line}hdr,/;
chomp;
print "row${line}hdr,", (split/,/)[$col], "\n"
};

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

As someone who doesn't remember how to shell script anything more
interesting than a loop, I'd have to say that, for me at least, Perl is
better than shell scripting for this.

Xho
 
C

Charles DeRykus

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.


I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.


col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

I wouldn't go to shell-hell for this... :)

A new module File::Tablular has lots of functionality for this if
you anticipate more complexity ... or maybe DBD::CSV.

Or, if you need only simple operations Tie::File might help in
framing a data structure that you could expand by splitting the
lines.

Or, if you can't be bothered with all that... something like
this might work:

perl -F, -lane 'push( @{$line[$i++]}, @F );
END{ print @{line[$_]}[0,3] for (0, 1, 4, 9) }' csv_file > out


hth,
 
M

martin

Thanks, I should have clarified one point. with respect to your
question

" Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?"

I used 'row2hdr' simplay as a label, so number 2 there or any other
digit is not the point. The row and column labels are strings, and for
that matter could have been "apple", "orange", "tomato". so what needs
to be done is to reference entries through string labels, without
numerical characters. I don't want to use number in rows and strings
because the position of rows could change; and prefer to index entries
through matching row labels and column labels.

and by efficiency I meant a combination of both, leaning towards time
of execution. of course I was not referring to assembly though.

tx

martin
Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?


Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?
What is the most efficient way to do this,

Efficient in your time? (depends on how good you are in every possible
language). Efficient on the computer's time? Probably assembly language.

is there a built in
function in perl that does it,

No, not directly.
how can I perl script this? should I turn the file into an array line
by line,

No. There is no reason to slurp it (if that is what you meant), although
it probably wouldn't hurt much for 1000 lines of 1000 numbers. And
certainly no reason to split every line, including the ones you don't care
about, if that is what you meant.
or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Something like that, sure.

Assuming the list of rows is in order, something like this:

my $col=4;
my @rowlist=(2,5,10)
$col--; # convert to 0-started array
foreach my $line (@rowlist) {
defined ($_=<$fh>) or die until /^row${line}hdr,/;
chomp;
print "row${line}hdr,", (split/,/)[$col], "\n"
};

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

As someone who doesn't remember how to shell script anything more
interesting than a loop, I'd have to say that, for me at least, Perl is
better than shell scripting for this.

Xho
 
G

Gunnar Hjalmarsson

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results.

What is the most efficient way to do this,

Which ways have you considered? You can always use the Benchmark module
to compare them...
 
X

Xicheng Jia

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.


I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.


col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

== and also is it best to use perl or unix shell scripting for this.
any
== input appreciated.>

You should handle this by line-mode and it can be easily done on the
command line. I dont think perl will be better than awk though:

perl -F, -anle 'print join"\t", @F[0,3] if $. =~ /^(1|2|5|10)$/'
myfile.csv

awk -F, 'NR ~ /^(1|2|5|10)$/ {print $1,$3}' myfile.csv

Xicheng
 
X

Xicheng Jia

Xicheng said:
martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.


I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.


col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

== and also is it best to use perl or unix shell scripting for this.
any
== input appreciated.>

You should handle this by line-mode and it can be easily done on the
command line. I dont think perl will be better than awk though:

perl -F, -anle 'print join"\t", @F[0,3] if $. =~ /^(1|2|5|10)$/'
myfile.csv
= awk -F, 'NR ~ /^(1|2|5|10)$/ {print $1,$3}' myfile.csv

~~change to $4

In fact, if regex is a consideration factor, 'awk' may be better than
'perl', since awk is using DFA regex engine which is faster than perl's
NFA.

Xicheng
 
A

Anno Siegel

martin said:
Thanks, I should have clarified one point. with respect to your
question

" Do you recognize row 2 because the label is row2hdr, or because it is

I used 'row2hdr' simplay as a label, so number 2 there or any other
digit is not the point. The row and column labels are strings, and for
that matter could have been "apple", "orange", "tomato". so what needs
to be done is to reference entries through string labels, without
numerical characters. I don't want to use number in rows and strings
because the position of rows could change; and prefer to index entries
through matching row labels and column labels.

and by efficiency I meant a combination of both, leaning towards time
of execution. of course I was not referring to assembly though.

You have essentially two problems then (which is good, it's called
problem separation). One is to select the lines to process, the other
is to select from those lines the columns requested.

Assume the requirements in two variables, for instance:

my @lines = qw( row2hdr rowMhdr);
my @cols = qw( col1label colNlabel col2label);

We want to select all lines that begin with one of the given headers.
That is a job for a regular expression, so let's compile one:

my $line_re = do {
my $re = join '|', sort { length( $b) <=> length( $a) } @lines;
qr/^$re/;
};

We sort long alternatives ahead of short ones in case one is a prefix
of another. Now $line_re will match the lines that interest us and no
others.

Column selection cannot be done using strings alone, because the header
strings are nowhere around when we look at a data line. Instead, we
translate the set of column headers into a set of integer indices that
point to the columns we want. For the translation we use the first line
of the data file, containing the actual headers for this set of data:

my @sel = do {
chomp( my $h_line = <DATA>);
my @headers = split /\s*,\s*/, $h_line;
my %col_of_header = map { shift( @headers) => $_ } 0 .. $#headers;
my @bad = grep !defined $col_of_header{ $_}, @cols;
die "bad header(s): @bad" if @bad;
@col_of_header{ @cols};
};

Now we're ready to print the header line and the selected data:

print join( ', ', @cols), "\n";
while ( <DATA> ) {
next unless /$line_re/;
my @recs = split /\s*,\s*/;
print join( ', ', @recs[ @sel]), "\n";
}

__DATA__
col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ...., xxxx, ...., yyyy, zzzz
... ,... , ... , ... , ...., ...
rowMhdr, ...., ..., ... , ..., ...., corner

Anno
 
M

martin

Hi everyone, thanks for all the valuable input. I need to read a bit to
digest. The last reply brought another question, that I will read a bit
on before doing a posting. Thanks again. Martin
 
X

Xicheng Jia

martin said:
Thanks, I should have clarified one point. with respect to your
question

" Do you recognize row 2 because the label is row2hdr, or because it is
=> I used 'row2hdr' simplay as a label, so number 2 there or any other
=> digit is not the point. The row and column labels are strings, and
for
=> that matter could have been "apple", "orange", "tomato". so what
needs
=> to be done is to reference entries through string labels, without
=> numerical characters. I don't want to use number in rows and strings
=> because the position of rows could change; and prefer to index
entries
=> through matching row labels and column labels.

this kind of text searching can be done in Perl like:

perl -ne ' @x = split/\s*,\s*/;
print join ",", @x[0, 3] if $.==1 || /^(row1hdr|row2hdr)/'
myfile.csv

or in awk:

awk -F' *, *' 'NR==1 || $1 ~ /^(row1hdr|row2hdr)/ {print $1,$4}'
myfile.csv

( awk's -F option on the command-line is more powerful than perl's )

Xicheng
and by efficiency I meant a combination of both, leaning towards time
of execution. of course I was not referring to assembly though.

tx

martin
Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?


Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?
What is the most efficient way to do this,

Efficient in your time? (depends on how good you are in every possible
language). Efficient on the computer's time? Probably assembly language.

is there a built in
function in perl that does it,

No, not directly.
how can I perl script this? should I turn the file into an array line
by line,

No. There is no reason to slurp it (if that is what you meant), although
it probably wouldn't hurt much for 1000 lines of 1000 numbers. And
certainly no reason to split every line, including the ones you don't care
about, if that is what you meant.
or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Something like that, sure.

Assuming the list of rows is in order, something like this:

my $col=4;
my @rowlist=(2,5,10)
$col--; # convert to 0-started array
foreach my $line (@rowlist) {
defined ($_=<$fh>) or die until /^row${line}hdr,/;
chomp;
print "row${line}hdr,", (split/,/)[$col], "\n"
};

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

As someone who doesn't remember how to shell script anything more
interesting than a loop, I'd have to say that, for me at least, Perl is
better than shell scripting for this.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top