A
Alan Mead
I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:
Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
.... etc..
So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.
In a later file dozens of records appear on the same line.
I'd like to output
lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
[email protected]
Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.
-Alan
my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('','','','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
....
}
All five have slightly different fairly unstructured formats. One looks
like this:
Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
.... etc..
So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.
In a later file dozens of records appear on the same line.
I'd like to output
lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
[email protected]
Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.
-Alan
my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('','','','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
....
}