strategy for parsing text file

C

ccc31807

I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Thanks, CC.

-------------file below--------------------
Number
BandName
Grade
Branch
Instr
PipingInst
PInstDate
DrumInst
DrumInstDate
91709
87th Cleveland Pipe Band IV
PB4
Ohio Valley
y
Tyler Tagliafero, Great Lakes
01-Mar-09
Drew Donnelly, Great Lakes
01-Mar-09
91068
Adirondack Pipes & Drums
PB5
Northeast
n
91212
Alabama Pipes & Drums
PB4
Southern
n
91801
Albany Police P&D
PB5
Northeast
y
Dan Cole, Oran Mor
01-Mar-09
92033
American Celtic Pipe Band
PB5
Metro
n
91826
Anderson Pipe Band
PB5
Southwest
y
Victor Anderson, Westminster
01-Mar-09
Tim Vermillion, Westminster
01-Mar-09
91802
AOH Pipe & Drum Band
PB5
Northeast
n
 
C

ccc31807

I will assume that you are absolutely certain that none of the other
field's values will match that specification...
Absolutely!

-------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

while ( <DATA> ) {
    next unless /^9[12]\d\d\d$/;  # 5 digits, starts with 91 or 92
    my @record = $_;
    push @record, scalar(<DATA>) for 1..3;
    chomp @record;
    print Dumper \@record;}

I see ... you can access the file within the while loop by using the
<> in an inner loop. I maybe should have thought of that, but I had to
produce it quickly and didn't want to experiment.

Thanks, and here is the guts of my solution. Pretty crude, but it
worked.

open INFILE, '<', 'bands.txt';
while (<INFILE>)
{
next unless /\w/;
print; #debugging
chomp;
if (/9[12]\d{3}/)
{
$count++;
$key = $_;
$flag = 1;
}
elsif ($flag == 1)
{
$bands{$key}{name} = $_;
$flag = 2;
}
elsif ($flag ==2)
{
$bands{$key}{grade} = $_;
$flag = 3;
}
elsif ($flag == 3)
{
$bands{$key}{branch} = $_;
$flag = 0;
}

}

#print the %bands hash to a .csv file
 
S

Steve C

RedGrittyBrick said:
ccc31807 said:
I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

#!perl
use strict;
use warnings;

my @f;
while (<DATA>) {
chomp;
if (/^9[12]\d{3}$/) {
print join (',', @f), "\n" if @f;
@f=();
}
push @f, $_;
}

__DATA__
I think you are losing the last record.
 
S

sln

I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Thanks, CC.

This is knarly too. Its just the inner while if you
can slurp the whole file, but somehow I don't think
you want that.

-sln

Output:

"Number","BandName","Grade","Branch","Instr","PipingInst","PInstDate","DrumInst","DrumInstDate"
"91709","87th Cleveland Pipe Band IV","PB4","Ohio Valley"
"91068","Adirondack Pipes & Drums","PB5","Northeast"
"91212","Alabama Pipes & Drums","PB4","Southern"
"91801","Albany Police P&D","PB5","Northeast"
"92033","American Celtic Pipe Band","PB5","Metro"
"91826","Anderson Pipe Band","PB5","Southwest"
"91802","AOH Pipe & Drum Band","PB5","Northeast"

==========

use strict;
use warnings;

my ($header,$line,$data) = (1);

while ($line=<DATA>)
{
$line = '' if $line =~ /^\s*$/;
my $end = eof(DATA);
$data .= $line if $end;

if ($end || $line =~ /^9[12]\d{3}/)
{
# process header
if ($header) {
$header = 0;
my $cnt = 1;
$data =~ /((?:^.*\n){9})/mg;
print "\"$_\"".($cnt++ < 9 ? ',':"\n") for (split /\n/, $1);
}
# process record
else {
while ($data =~ /(^9[12]\d{3}\n(?:^(?!9[12]\d{3}).*\n){4,8})/mg)
{
my $cnt = 1;
print "\"$_\"".($cnt++ < 4 ? ',':"\n") for (split /\n/, $1)[0..3];
}
}
$data = $line;
next;
}
$data .= $line;
}
 
S

sln

use strict;
use warnings;

my ($header,$line,$data) = (1);

while ($line=<DATA>)
{
$line = '' if $line =~ /^\s*$/;
my $end = eof(DATA);
$data .= $line if $end;

if ($end || $line =~ /^9[12]\d{3}/)
{
# process header
if ($header) {
$header = 0;
my $cnt = 1;
if ($data =~ /((?:^.*\n){9})/mg) {
print "\"$_\"".($cnt++ < 9 ? ',':"\n") for (split /\n/, $1);
}
}
# process record
else {
my $cnt = 1;
if ($data =~ /(^9[12]\d{3}\n(?:^.*\n){4,8})/mg) {
print "\"$_\"".($cnt++ < 4 ? ',':"\n") for (split /\n/, $1)[0..3];
}
}
$data = $line;
next;
}
$data .= $line;
}

Sorry, the short version: process record 'while' before was for if the file is slurped
and used a negative look ahead. Still works for single record but is not needed.

-sln
 
J

John W. Krahn

ccc31807 said:
I've solved this problem, but I'm just curious as to how by betters
would approach this.

The file is a long file, so I have copied only the first seven records
below as an example. The file is from a table with nine fields, all of
which are named in the first nine lines. The key is a five digit
number beginning with either 91 or 92. For each record, sometimes all
fields are populated (like the first, 91709), but normally only the
first four are guaranteed to be populated while the remaining five may
or may not have values. Each datum occupies a line all to itself, and
the file does not contain record separators.

The requirement is to capture the first four fields and write to an
Excel readable file (CSV format).

My solution was pretty dirty and crude, but I'll share it later (and
take the hit for stupidity). My question is how others might approach
the problem. Below is the first seven records of the file and the
column header.

Thanks, CC.

-------------file below--------------------
Number
BandName
Grade
Branch
Instr
PipingInst
PInstDate
DrumInst
DrumInstDate
91709
87th Cleveland Pipe Band IV
PB4
Ohio Valley
y
Tyler Tagliafero, Great Lakes
01-Mar-09
Drew Donnelly, Great Lakes
01-Mar-09
91068
Adirondack Pipes & Drums
PB5
Northeast
n
91212
Alabama Pipes & Drums
PB4
Southern
n
91801
Albany Police P&D
PB5
Northeast
y
Dan Cole, Oran Mor
01-Mar-09
92033
American Celtic Pipe Band
PB5
Metro
n
91826
Anderson Pipe Band
PB5
Southwest
y
Victor Anderson, Westminster
01-Mar-09
Tim Vermillion, Westminster
01-Mar-09
91802
AOH Pipe & Drum Band
PB5
Northeast
n


my @data = [];
while ( <FILE> ) {
chomp;
/^9[12]/ && push @data, [];
push @{ $data[ -1 ] }, qq/"$_"/;
if ( @data == 2 || eof ) {
no warnings 'uninitialized';
print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";
}
}




John
 
C

ccc31807

John, sorry, but I haven't seen some of what you used. Do you mine
helping me out?

[] returns a reference to an anonymous array, right? How does it work
assigning it to an array type?
my @data = [];
while ( <FILE> ) {
     chomp;

I understand the use of the conjunctive Boolean, but again, I don't
understand how pushing [] to the array works.
     /^9[12]/ && push @data, [];

This pushes $_ to the end of the array, but how to you designate the
value of $_ in this case?
     push @{ $data[ -1 ] }, qq/"$_"/;
     if ( @data == 2 || eof ) {
         no warnings 'uninitialized';

Why '8'? The problem is that the values can be anywhere from three to
eight, and you don't know how many or which ones.
         print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";
         }
     }

When I looked at the data file, I saw this pseudocode:
read each line
if the line is the key:
save the value as a key
read the next three lines
write each value as the value of a hash element for the key

Two points -- (1) I didn't take the time to explore accessing the
lines of the file in an inner loop, although that occurred to me,
which is why Tad's example made the light bulb light up. (2) It seems
much more natural to use a hash rather than an array to hold the data
elements, and now I'm wondering if using an array to hold the records
is a better solution.

The output part of my script looks like this:
foreach my $k (keys %bands)
{
print OUTFILE qq("$k","$bands{$k}{name}","$bands{$k}{grade}","$bands
{$k}{branch}"\n);

}

To me, this looks a lot more intuitive and understandable than some of
the print statements above, which look convoluted (if not obfuscated)
to me.

CC.
 
J

John W. Krahn

ccc31807 said:
John, sorry, but I haven't seen some of what you used. Do you mine
helping me out?

Ok, I'll try. :)

[] returns a reference to an anonymous array, right? How does it work
assigning it to an array type?

Just the same as assigning any scalar to an array. The first element of
the array now contains a reference to an array.

my @data = [];
while ( <FILE> ) {
chomp;

I understand the use of the conjunctive Boolean, but again, I don't
understand how pushing [] to the array works.
/^9[12]/ && push @data, [];

That adds a scalar value onto the end of the array. In this case the
scalar value is a reference to an array.

This pushes $_ to the end of the array, but how to you designate the
value of $_ in this case?

I don't know what you mean by "designate the value of $_"?

push @{ $data[ -1 ] }, qq/"$_"/;
if ( @data == 2 || eof ) {
no warnings 'uninitialized';

Why '8'? The problem is that the values can be anywhere from three to
eight, and you don't know how many or which ones.
print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";

I assumed that you meant that each record *should* have 9 fields, but if
that is not what you want then just remove the '[ 0 .. 8 ]' part.

When I looked at the data file, I saw this pseudocode:
read each line
if the line is the key:
save the value as a key
read the next three lines
write each value as the value of a hash element for the key

Two points -- (1) I didn't take the time to explore accessing the
lines of the file in an inner loop, although that occurred to me,
which is why Tad's example made the light bulb light up. (2) It seems
much more natural to use a hash rather than an array to hold the data
elements, and now I'm wondering if using an array to hold the records
is a better solution.

TMTOWTDI ;-)

The output part of my script looks like this:
foreach my $k (keys %bands)
{
print OUTFILE qq("$k","$bands{$k}{name}","$bands{$k}{grade}","$bands
{$k}{branch}"\n);

}

To me, this looks a lot more intuitive and understandable than some of
the print statements above, which look convoluted (if not obfuscated)
to me.


John
 
C

ccc31807

while ( <DATA> ) {
    next unless /^9[12]\d\d\d$/;  # 5 digits, starts with 91 or 92
    my %record = (number => $_);
    $record{bandname} = <DATA>;
    $record{grade} = <DATA>;
    $record{branch} = <DATA>;
    chomp %record;
    print Dumper \%record;}

Yes. This is almost identical to what I had after I saw your first
solution, except for a small variation in the hash variable. I chose a
hash because I anticipated a need to sort by branch and possible by
grade.

This was a throwaway script, that I ran exactly once, so while I agree
with checking the value of open() and using more meaningful names,
this was just the first cut and was all I needed.

Thanks for your help. I now know about using <> in inner loops.

CC.
 
S

sln

while ( <DATA> ) {
    next unless /^9[12]\d\d\d$/;  # 5 digits, starts with 91 or 92
    my %record = (number => $_);
    $record{bandname} = <DATA>;
    $record{grade} = <DATA>;
    $record{branch} = <DATA>;
    chomp %record;
    print Dumper \%record;}

Yes. This is almost identical to what I had after I saw your first
solution, except for a small variation in the hash variable. I chose a
hash because I anticipated a need to sort by branch and possible by
grade.

This was a throwaway script, that I ran exactly once, so while I agree
with checking the value of open() and using more meaningful names,
this was just the first cut and was all I needed.

Thanks for your help. I now know about using <> in inner loops.

CC.

Another difference is that you are accumulating a hash of the
total of all the records, he is just making a temp hash on a record
by record basis.

Neither way cares about error checking, blank lines, headers,
field position or any validation whatsoever.
So, in all the responces here, there is no method or technique being better
or worse in this light, its just throwaway.


while (<DATA>) {
chomp;
(/^9[12]\d{3}$/ and
@{$bands{$_}}{'name','grade','branch'}
= split /\n/, <DATA>.<DATA>.<DATA>)
}

or same, but slurp file ..

$_ = join '',<DATA>;
while (/(^9[12]\d{3})\n((?:^(?!9[12]\d{3}\n).*\n){3})/mg) {
@{$bands{$1}}{'name','grade','branch'} = split /\n/, $2;
}

-sln
 
C

ccc31807

So, in all the responces here, there is no method or technique being better
or worse in this light, its just throwaway.

while (<DATA>) {
    chomp;
    (/^9[12]\d{3}$/ and
    @{$bands{$_}}{'name','grade','branch'}
      = split /\n/, <DATA>.<DATA>.<DATA>)

}

Yes! I like this!

I have developed a habit of using a hash slice when dealing with data
file that come with their own header, and use the hash to populate a
hash for each line to manage and mangle the output.

Sometimes I have a need to sort the data by some strange and alien
method, so I have also developed the habit of using a hash for the
data. Recently I have build several scripts that output PDFs of
multiple records categorized in various ways, and have found that
hashes are ideal for this purpose.

Anyway, the essential insight is that <> can be used to get the next
record regardless of the level of the braces.

CC
or same, but slurp file ..

$_ = join '',<DATA>;
while (/(^9[12]\d{3})\n((?:^(?!9[12]\d{3}\n).*\n){3})/mg) {
  @{$bands{$1}}{'name','grade','branch'} = split /\n/, $2;

}

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top