strategy for parsing text file

ccc31807 · Aug 28, 2009

I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Thanks, CC.

-------------file below--------------------
Number
BandName
Grade
Branch
Instr
PipingInst
PInstDate
DrumInst
DrumInstDate
91709
87th Cleveland Pipe Band IV
PB4
Ohio Valley
y
Tyler Tagliafero, Great Lakes
01-Mar-09
Drew Donnelly, Great Lakes
01-Mar-09
91068
Adirondack Pipes & Drums
PB5
Northeast
n
91212
Alabama Pipes & Drums
PB4
Southern
n
91801
Albany Police P&D
PB5
Northeast
y
Dan Cole, Oran Mor
01-Mar-09
92033
American Celtic Pipe Band
PB5
Metro
n
91826
Anderson Pipe Band
PB5
Southwest
y
Victor Anderson, Westminster
01-Mar-09
Tim Vermillion, Westminster
01-Mar-09
91802
AOH Pipe & Drum Band
PB5
Northeast
n

ccc31807 · Aug 28, 2009

I will assume that you are absolutely certain that none of the other
field's values will match that specification...
Absolutely!

-------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data:umper;

while ( <DATA> ) {
next unless /^9[12]\d\d\d$/; # 5 digits, starts with 91 or 92
my @record = $_;
push @record, scalar(<DATA>) for 1..3;
chomp @record;
print Dumper \@record;}

I see ... you can access the file within the while loop by using the
<> in an inner loop. I maybe should have thought of that, but I had to
produce it quickly and didn't want to experiment.

Thanks, and here is the guts of my solution. Pretty crude, but it
worked.

open INFILE, '<', 'bands.txt';
while (<INFILE>)
{
next unless /\w/;
print; #debugging
chomp;
if (/9[12]\d{3}/)
{
$count++;
$key = $_;
$flag = 1;
}
elsif ($flag == 1)
{
$bands{$key}{name} = $_;
$flag = 2;
}
elsif ($flag ==2)
{
$bands{$key}{grade} = $_;
$flag = 3;
}
elsif ($flag == 3)
{
$bands{$key}{branch} = $_;
$flag = 0;
}

}

#print the %bands hash to a .csv file

Steve C · Aug 28, 2009

RedGrittyBrick said:
ccc31807 said:

I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Click to expand...

#!perl
use strict;
use warnings;

my @f;
while (<DATA>) {
chomp;
if (/^9[12]\d{3}$/) {
print join (',', @f), "\n" if @f;
@f=();
}
push @f, $_;
}

__DATA__

I think you are losing the last record.

sln · Aug 29, 2009

I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Thanks, CC.

This is knarly too. Its just the inner while if you
can slurp the whole file, but somehow I don't think
you want that.

-sln

Output:

"Number","BandName","Grade","Branch","Instr","PipingInst","PInstDate","DrumInst","DrumInstDate"
"91709","87th Cleveland Pipe Band IV","PB4","Ohio Valley"
"91068","Adirondack Pipes & Drums","PB5","Northeast"
"91212","Alabama Pipes & Drums","PB4","Southern"
"91801","Albany Police P&D","PB5","Northeast"
"92033","American Celtic Pipe Band","PB5","Metro"
"91826","Anderson Pipe Band","PB5","Southwest"
"91802","AOH Pipe & Drum Band","PB5","Northeast"

==========

use strict;
use warnings;

my ($header,$line,$data) = (1);

while ($line=<DATA>)
{
$line = '' if $line =~ /^\s*$/;
my $end = eof(DATA);
$data .= $line if $end;

if ($end || $line =~ /^9[12]\d{3}/)
{
# process header
if ($header) {
$header = 0;
my $cnt = 1;
$data =~ /((?:^.*\n){9})/mg;
print "\"$_\"".($cnt++ < 9 ? ',':"\n") for (split /\n/, $1);
}
# process record
else {
while ($data =~ /(^9[12]\d{3}\n(?:^(?!9[12]\d{3}).*\n){4,8})/mg)
{
my $cnt = 1;
print "\"$_\"".($cnt++ < 4 ? ',':"\n") for (split /\n/, $1)[0..3];
}
}
$data = $line;
next;
}
$data .= $line;
}

sln · Aug 29, 2009

use strict;
use warnings;

my ($header,$line,$data) = (1);

while ($line=<DATA>)
{
$line = '' if $line =~ /^\s*$/;
my $end = eof(DATA);
$data .= $line if $end;

if ($end || $line =~ /^9[12]\d{3}/)
{

# process header
if ($header) {
$header = 0;
my $cnt = 1;
if ($data =~ /((?:^.*\n){9})/mg) {
print "\"$_\"".($cnt++ < 9 ? ',':"\n") for (split /\n/, $1);
}
}
# process record
else {
my $cnt = 1;
if ($data =~ /(^9[12]\d{3}\n(?:^.*\n){4,8})/mg) {
print "\"$_\"".($cnt++ < 4 ? ',':"\n") for (split /\n/, $1)[0..3];
}
}

$data = $line;
next;
}
$data .= $line;
}

Sorry, the short version: process record 'while' before was for if the file is slurped
and used a negative look ahead. Still works for single record but is not needed.

-sln

John W. Krahn · Aug 29, 2009

ccc31807 said:
I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Thanks, CC.

-------------file below--------------------
Number
BandName
Grade
Branch
Instr
PipingInst
PInstDate
DrumInst
DrumInstDate
91709
87th Cleveland Pipe Band IV
PB4
Ohio Valley
y
Tyler Tagliafero, Great Lakes
01-Mar-09
Drew Donnelly, Great Lakes
01-Mar-09
91068
Adirondack Pipes & Drums
PB5
Northeast
n
91212
Alabama Pipes & Drums
PB4
Southern
n
91801
Albany Police P&D
PB5
Northeast
y
Dan Cole, Oran Mor
01-Mar-09
92033
American Celtic Pipe Band
PB5
Metro
n
91826
Anderson Pipe Band
PB5
Southwest
y
Victor Anderson, Westminster
01-Mar-09
Tim Vermillion, Westminster
01-Mar-09
91802
AOH Pipe & Drum Band
PB5
Northeast
n

my @data = [];
while ( <FILE> ) {
chomp;
/^9[12]/ && push @data, [];
push @{ $data[ -1 ] }, qq/"$_"/;
if ( @data == 2 || eof ) {
no warnings 'uninitialized';
print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";
}
}

John

ccc31807 · Aug 29, 2009

John, sorry, but I haven't seen some of what you used. Do you mine
helping me out?

[] returns a reference to an anonymous array, right? How does it work
assigning it to an array type?

my @data = [];
while ( <FILE> ) {
chomp;

I understand the use of the conjunctive Boolean, but again, I don't
understand how pushing [] to the array works.

/^9[12]/ && push @data, [];

This pushes $_ to the end of the array, but how to you designate the
value of $_ in this case?

push @{ $data[ -1 ] }, qq/"$_"/;
if ( @data == 2 || eof ) {
no warnings 'uninitialized';

Why '8'? The problem is that the values can be anywhere from three to
eight, and you don't know how many or which ones.

print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";
}
}

When I looked at the data file, I saw this pseudocode:
read each line
if the line is the key:
save the value as a key
read the next three lines
write each value as the value of a hash element for the key

Two points -- (1) I didn't take the time to explore accessing the
lines of the file in an inner loop, although that occurred to me,
which is why Tad's example made the light bulb light up. (2) It seems
much more natural to use a hash rather than an array to hold the data
elements, and now I'm wondering if using an array to hold the records
is a better solution.

The output part of my script looks like this:
foreach my $k (keys %bands)
{
print OUTFILE qq("$k","$bands{$k}{name}","$bands{$k}{grade}","$bands
{$k}{branch}"\n);

}

To me, this looks a lot more intuitive and understandable than some of
the print statements above, which look convoluted (if not obfuscated)
to me.

CC.

John W. Krahn · Aug 29, 2009

ccc31807 said:
John, sorry, but I haven't seen some of what you used. Do you mine
helping me out?

Ok, I'll try.

[] returns a reference to an anonymous array, right? How does it work
assigning it to an array type?

Just the same as assigning any scalar to an array. The first element of
the array now contains a reference to an array.

my @data = [];
while ( <FILE> ) {
chomp;

Click to expand...

I understand the use of the conjunctive Boolean, but again, I don't
understand how pushing [] to the array works.

/^9[12]/ && push @data, [];

Click to expand...

That adds a scalar value onto the end of the array. In this case the
scalar value is a reference to an array.

This pushes $_ to the end of the array, but how to you designate the
value of $_ in this case?

I don't know what you mean by "designate the value of $_"?

push @{ $data[ -1 ] }, qq/"$_"/;
if ( @data == 2 || eof ) {
no warnings 'uninitialized';

Click to expand...

Why '8'? The problem is that the values can be anywhere from three to
eight, and you don't know how many or which ones.

print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";

Click to expand...

I assumed that you meant that each record *should* have 9 fields, but if
that is not what you want then just remove the '[ 0 .. 8 ]' part.

When I looked at the data file, I saw this pseudocode:
read each line
if the line is the key:
save the value as a key
read the next three lines
write each value as the value of a hash element for the key

Two points -- (1) I didn't take the time to explore accessing the
lines of the file in an inner loop, although that occurred to me,
which is why Tad's example made the light bulb light up. (2) It seems
much more natural to use a hash rather than an array to hold the data
elements, and now I'm wondering if using an array to hold the records
is a better solution.

TMTOWTDI ;-)

The output part of my script looks like this:
foreach my $k (keys %bands)
{
print OUTFILE qq("$k","$bands{$k}{name}","$bands{$k}{grade}","$bands
{$k}{branch}"\n);

}

To me, this looks a lot more intuitive and understandable than some of
the print statements above, which look convoluted (if not obfuscated)
to me.

John

dn.perl · Aug 29, 2009

RedGrittyBrick said:
RedGrittyBrick said:

#!perl
use strict;
use warnings;

Click to expand...

my @f;
while (<DATA>) {
chomp;
if (/^9[12]\d{3}$/) {
print join (',', @f), "\n" if @f;
@f=();
}
push @f, $_;
}

Click to expand...

__DATA__

Click to expand...

I think you are losing the last record.

That script has one more flaw. It publishes all the elements of @f,
whereas the OP wants only the first 4 elements.

ccc31807 · Aug 29, 2009

while ( <DATA> ) {
next unless /^9[12]\d\d\d$/; # 5 digits, starts with 91 or 92
my %record = (number => $_);
$record{bandname} = <DATA>;
$record{grade} = <DATA>;
$record{branch} = <DATA>;
chomp %record;
print Dumper \%record;}

Yes. This is almost identical to what I had after I saw your first
solution, except for a small variation in the hash variable. I chose a
hash because I anticipated a need to sort by branch and possible by
grade.

This was a throwaway script, that I ran exactly once, so while I agree
with checking the value of open() and using more meaningful names,
this was just the first cut and was all I needed.

Thanks for your help. I now know about using <> in inner loops.

CC.

sln · Sep 1, 2009

while ( <DATA> ) {
next unless /^9[12]\d\d\d$/; # 5 digits, starts with 91 or 92
my %record = (number => $_);
$record{bandname} = <DATA>;
$record{grade} = <DATA>;
$record{branch} = <DATA>;
chomp %record;
print Dumper \%record;}

Click to expand...

Yes. This is almost identical to what I had after I saw your first
solution, except for a small variation in the hash variable. I chose a
hash because I anticipated a need to sort by branch and possible by
grade.

This was a throwaway script, that I ran exactly once, so while I agree
with checking the value of open() and using more meaningful names,
this was just the first cut and was all I needed.

Thanks for your help. I now know about using <> in inner loops.

CC.

Another difference is that you are accumulating a hash of the
total of all the records, he is just making a temp hash on a record
by record basis.

Neither way cares about error checking, blank lines, headers,
field position or any validation whatsoever.
So, in all the responces here, there is no method or technique being better
or worse in this light, its just throwaway.

while (<DATA>) {
chomp;
(/^9[12]\d{3}$/ and
@{$bands{$_}}{'name','grade','branch'}
= split /\n/, <DATA>.<DATA>.<DATA>)
}

or same, but slurp file ..

$_ = join '',<DATA>;
while (/(^9[12]\d{3})\n((?:^(?!9[12]\d{3}\n).*\n){3})/mg) {
@{$bands{$1}}{'name','grade','branch'} = split /\n/, $2;
}

-sln

ccc31807 · Sep 1, 2009

So, in all the responces here, there is no method or technique being better
or worse in this light, its just throwaway.

while (<DATA>) {
chomp;
(/^9[12]\d{3}$/ and
@{$bands{$_}}{'name','grade','branch'}
= split /\n/, <DATA>.<DATA>.<DATA>)

}

Yes! I like this!

I have developed a habit of using a hash slice when dealing with data
file that come with their own header, and use the hash to populate a
hash for each line to manage and mangle the output.

Sometimes I have a need to sort the data by some strange and alien
method, so I have also developed the habit of using a hash for the
data. Recently I have build several scripts that output PDFs of
multiple records categorized in various ways, and have found that
hashes are ideal for this purpose.

Anyway, the essential insight is that <> can be used to get the next
record regardless of the level of the braces.

CC

or same, but slurp file ..

$_ = join '',<DATA>;
while (/(^9[12]\d{3})\n((?:^(?!9[12]\d{3}\n).*\n){3})/mg) {
@{$bands{$1}}{'name','grade','branch'} = split /\n/, $2;

}

-sln

Trouble with parsing text file and grabbing values needed	8	Jul 21, 2006
Text::CSV problem	5	Oct 15, 2008
Assistance parsing text file using Text::CSV_XS	6	Sep 1, 2004
Collect Excel Data from Website	5	Apr 30, 2022
sorting text in a file	4	Mar 26, 2008
Is Scanner's nextLine() Supposed to Return True with Unread Empty Lines?	1	Mar 13, 2011
Importing from text file to Excel	0	Dec 15, 2006
XML -> Tab-delimited text file (using lxml)	2	Nov 19, 2008

strategy for parsing text file

ccc31807

ccc31807

Steve C

sln

sln

John W. Krahn

ccc31807

John W. Krahn

dn.perl

ccc31807

sln

ccc31807

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads