Newbie question: most efficient way to search fields of this file

martin · Apr 14, 2006

Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
.... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

thanks. martin

xhoster · Apr 14, 2006

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

What is the most efficient way to do this,

Efficient in your time? (depends on how good you are in every possible
language). Efficient on the computer's time? Probably assembly language.

is there a built in
function in perl that does it,

No, not directly.

how can I perl script this? should I turn the file into an array line
by line,

No. There is no reason to slurp it (if that is what you meant), although
it probably wouldn't hurt much for 1000 lines of 1000 numbers. And
certainly no reason to split every line, including the ones you don't care
about, if that is what you meant.

or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Something like that, sure.

Assuming the list of rows is in order, something like this:

my $col=4;
my @rowlist=(2,5,10)
$col--; # convert to 0-started array
foreach my $line (@rowlist) {
defined ($_=<$fh>) or die until /^row${line}hdr,/;
chomp;
print "row${line}hdr,", (split/,/)[$col], "\n"
};

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

As someone who doesn't remember how to shell script anything more
interesting than a loop, I'd have to say that, for me at least, Perl is
better than shell scripting for this.

Xho

Charles DeRykus · Apr 14, 2006

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

I wouldn't go to shell-hell for this...

A new module File::Tablular has lots of functionality for this if
you anticipate more complexity ... or maybe DBD::CSV.

Or, if you need only simple operations Tie::File might help in
framing a data structure that you could expand by splitting the
lines.

Or, if you can't be bothered with all that... something like
this might work:

perl -F, -lane 'push( @{$line[$i++]}, @F );
END{ print @{line[$_]}[0,3] for (0, 1, 4, 9) }' csv_file > out

hth,

martin · Apr 14, 2006

Thanks, I should have clarified one point. with respect to your
question

" Do you recognize row 2 because the label is row2hdr, or because it is

2nd line after the header?"

I used 'row2hdr' simplay as a label, so number 2 there or any other
digit is not the point. The row and column labels are strings, and for
that matter could have been "apple", "orange", "tomato". so what needs
to be done is to reference entries through string labels, without
numerical characters. I don't want to use number in rows and strings
because the position of rows could change; and prefer to index entries
through matching row labels and column labels.

and by efficiency I meant a combination of both, leaning towards time
of execution. of course I was not referring to assembly though.

tx

martin

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

Click to expand...

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

What is the most efficient way to do this,

Click to expand...

Efficient in your time? (depends on how good you are in every possible
language). Efficient on the computer's time? Probably assembly language.

is there a built in
function in perl that does it,

Click to expand...

No, not directly.

how can I perl script this? should I turn the file into an array line
by line,

Click to expand...

No. There is no reason to slurp it (if that is what you meant), although
it probably wouldn't hurt much for 1000 lines of 1000 numbers. And
certainly no reason to split every line, including the ones you don't care
about, if that is what you meant.

or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Click to expand...

Something like that, sure.

Assuming the list of rows is in order, something like this:

my $col=4;
my @rowlist=(2,5,10)
$col--; # convert to 0-started array
foreach my $line (@rowlist) {
defined ($_=<$fh>) or die until /^row${line}hdr,/;
chomp;
print "row${line}hdr,", (split/,/)[$col], "\n"
};

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

Click to expand...

As someone who doesn't remember how to shell script anything more
interesting than a loop, I'd have to say that, for me at least, Perl is
better than shell scripting for this.

Xho

Gunnar Hjalmarsson · Apr 14, 2006

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results.

What is the most efficient way to do this,

Which ways have you considered? You can always use the Benchmark module
to compare them...

Xicheng Jia · Apr 14, 2006

martin said:
Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

== and also is it best to use perl or unix shell scripting for this.
any
== input appreciated.>

You should handle this by line-mode and it can be easily done on the
command line. I dont think perl will be better than awk though:

perl -F, -anle 'print join"\t", @F[0,3] if $. =~ /^(1|2|5|10)$/'
myfile.csv

awk -F, 'NR ~ /^(1|2|5|10)$/ {print $1,$3}' myfile.csv

Xicheng

Xicheng Jia · Apr 14, 2006

Xicheng said:
martin said:

Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

What is the most efficient way to do this, is there a built in
function in perl that does it,
how can I perl script this? should I turn the file into an array line
by line, or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Click to expand...

== and also is it best to use perl or unix shell scripting for this.
any
== input appreciated.>

You should handle this by line-mode and it can be easily done on the
command line. I dont think perl will be better than awk though:

perl -F, -anle 'print join"\t", @F[0,3] if $. =~ /^(1|2|5|10)$/'
myfile.csv

= awk -F, 'NR ~ /^(1|2|5|10)$/ {print $1,$3}' myfile.csv

~~change to $4

In fact, if regex is a consideration factor, 'awk' may be better than
'perl', since awk is using DFA regex engine which is faster than perl's
NFA.

Xicheng

Anno Siegel · Apr 14, 2006

martin said:
Thanks, I should have clarified one point. with respect to your
question

" Do you recognize row 2 because the label is row2hdr, or because it is

I used 'row2hdr' simplay as a label, so number 2 there or any other
digit is not the point. The row and column labels are strings, and for
that matter could have been "apple", "orange", "tomato". so what needs
to be done is to reference entries through string labels, without
numerical characters. I don't want to use number in rows and strings
because the position of rows could change; and prefer to index entries
through matching row labels and column labels.

and by efficiency I meant a combination of both, leaning towards time
of execution. of course I was not referring to assembly though.

You have essentially two problems then (which is good, it's called
problem separation). One is to select the lines to process, the other
is to select from those lines the columns requested.

Assume the requirements in two variables, for instance:

my @lines = qw( row2hdr rowMhdr);
my @cols = qw( col1label colNlabel col2label);

We want to select all lines that begin with one of the given headers.
That is a job for a regular expression, so let's compile one:

my $line_re = do {
my $re = join '|', sort { length( $b) <=> length( $a) } @lines;
qr/^$re/;
};

We sort long alternatives ahead of short ones in case one is a prefix
of another. Now $line_re will match the lines that interest us and no
others.

Column selection cannot be done using strings alone, because the header
strings are nowhere around when we look at a data line. Instead, we
translate the set of column headers into a set of integer indices that
point to the columns we want. For the translation we use the first line
of the data file, containing the actual headers for this set of data:

my @sel = do {
chomp( my $h_line = <DATA>);
my @headers = split /\s*,\s*/, $h_line;
my %col_of_header = map { shift( @headers) => $_ } 0 .. $#headers;
my @bad = grep !defined $col_of_header{ $_}, @cols;
die "bad header(s): @bad" if @bad;
@col_of_header{ @cols};
};

Now we're ready to print the header line and the selected data:

print join( ', ', @cols), "\n";
while ( <DATA> ) {
next unless /$line_re/;
my @recs = split /\s*,\s*/;
print join( ', ', @recs[ @sel]), "\n";
}

__DATA__
col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ...., xxxx, ...., yyyy, zzzz
... ,... , ... , ... , ...., ...
rowMhdr, ...., ..., ... , ..., ...., corner

Anno

martin · Apr 15, 2006

Hi everyone, thanks for all the valuable input. I need to read a bit to
digest. The last reply brought another question, that I will read a bit
on before doing a posting. Thanks again. Martin

Xicheng Jia · Apr 15, 2006

martin said:
Thanks, I should have clarified one point. with respect to your
question

" Do you recognize row 2 because the label is row2hdr, or because it is

=> I used 'row2hdr' simplay as a label, so number 2 there or any other
=> digit is not the point. The row and column labels are strings, and
for
=> that matter could have been "apple", "orange", "tomato". so what
needs
=> to be done is to reference entries through string labels, without
=> numerical characters. I don't want to use number in rows and strings
=> because the position of rows could change; and prefer to index
entries
=> through matching row labels and column labels.

this kind of text searching can be done in Perl like:

perl -ne ' @x = split/\s*,\s*/;
print join ",", @x[0, 3] if $.==1 || /^(row1hdr|row2hdr)/'
myfile.csv

or in awk:

awk -F' *, *' 'NR==1 || $1 ~ /^(row1hdr|row2hdr)/ {print $1,$4}'
myfile.csv

( awk's -F option on the command-line is more powerful than perl's )

Xicheng

and by efficiency I meant a combination of both, leaning towards time
of execution. of course I was not referring to assembly though.

tx

martin

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

Click to expand...

Hi, I am quite new to perl and need to filter/process a file and print
into another file the processed file's results. I have .csv file that
is created by a perl script. The fomat of the file is
'M' lines or rows, and 'N' columns. the first row and the first column
are headers or labels.
all the fields are comma separated and are numeric values with the
excpetion of header and row columns shown below which are string
values, or labels.

Also the number of rows and columns could be as large as 1000 each.

I need to extract for a given column, example col4hdr, the
corresponding value for some of the rows and then print them in a
different file.

col1label, col2label, col3label, col4label, ..., colNlabel
row1hdr, int1, int2, ..., ... , intN-1
row2hdr, ... , ....,
... ,... , ... , ... , ...., ...
rowMhdr ...., ..., ... , ..., ....

for example: if I were to extract and display the results for column4
and rows 2, 5, and 10, the output should like something like this

col1label col4label
row2hdr field value for 2,4
row5hdr field value for 5,4
row10hdr field value for 10,4

Click to expand...

Do you recognize row 2 because the label is row2hdr, or because it is
2nd line after the header?

What is the most efficient way to do this,

Click to expand...

Efficient in your time? (depends on how good you are in every possible
language). Efficient on the computer's time? Probably assembly language.

is there a built in
function in perl that does it,

Click to expand...

No, not directly.

how can I perl script this? should I turn the file into an array line
by line,

Click to expand...

No. There is no reason to slurp it (if that is what you meant), although
it probably wouldn't hurt much for 1000 lines of 1000 numbers. And
certainly no reason to split every line, including the ones you don't care
about, if that is what you meant.

or simply grep the file line by line for patterns maching
rows2hdr, row5hdr, row10hdr and then count 4 fileds till I extract the
corresponding value and store it for display.

Click to expand...

Something like that, sure.

Assuming the list of rows is in order, something like this:

my $col=4;
my @rowlist=(2,5,10)
$col--; # convert to 0-started array
foreach my $line (@rowlist) {
defined ($_=<$fh>) or die until /^row${line}hdr,/;
chomp;
print "row${line}hdr,", (split/,/)[$col], "\n"
};

and also is it best to use perl or unix shell scripting for this. any
input appreciated.

Click to expand...

As someone who doesn't remember how to shell script anything more
interesting than a loop, I'd have to say that, for me at least, Perl is
better than shell scripting for this.

Xho

Click to expand...

most efficient way to get number of files in a directory	9	Jan 3, 2010
Most efficient way to "pre-grow" a list?	30	Nov 6, 2009
looking for efficient way to parse a file	4	Jan 12, 2008
Manipulate fields	6	Apr 20, 2007
sorting file according to a unicode column	17	May 28, 2014
Accessing the same row of table , other fields - Newbie question	0	Aug 17, 2007
Extraction Fields from a file	9	Jan 12, 2007
Efficient way to sum a product of numbers...	8	Aug 31, 2009

Newbie question: most efficient way to search fields of this file

martin

xhoster

Charles DeRykus

martin

Gunnar Hjalmarsson

Xicheng Jia

Xicheng Jia

Anno Siegel

martin

Xicheng Jia

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads