Reading poorly structured data

A

Alan Mead

I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:

Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
.... etc..

So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.

In a later file dozens of records appear on the same line.

I'd like to output

lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
[email protected]

Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.

-Alan

my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('','','','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
....
}
 
A

A. Sinan Unur

I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:

Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
... etc..

Here is somewhat of a kludge that "works" for the snippet you posted. Hope
this helps.

#! perl

use strict;
use warnings;

use File::Slurp;

my $input = read_file(\*DATA);
$input =~ tr/\n/ /;

my @records;

while(length $input) {
my %record;
$record{lname} = grab_name($input);
$record{fname} = grab_name($input);
$input =~ /[A-Z]{2} \d+/g;
$record{address} = substr $input, 0, pos($input);
$input = substr $input, pos($input);
if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
$record{email} = $1;
$input = substr $input, pos $input;
}
push @records, \%record;
}

use Data::Dumper;
print Dumper \@records;

sub grab_name {
my $off = index $_[0], ',';
my $name = substr $_[0], 0, $off;
$_[0] = substr $_[0], $off + 2;
return $name;
}

__DATA__
Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
 
A

Alan Mead

Here is somewhat of a kludge that "works" for the snippet you posted. Hope
this helps.

#! perl
use strict;
use warnings;
use File::Slurp;
my $input = read_file(\*DATA);
$input =~ tr/\n/ /;
my @records;
while(length $input) {
my %record;
$record{lname} = grab_name($input);
$record{fname} = grab_name($input);
$input =~ /[A-Z]{2} \d+/g;
$record{address} = substr $input, 0, pos($input);
$input = substr $input, pos($input);
if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
$record{email} = $1;
$input = substr $input, pos $input;
}
push @records, \%record;
}
[...]

And so it does very nicely. I think you are making use of the fact that
these all had a pair of capital letters near the end (including the
convenient UK) but there is a 'D.C.' in my data and some other
addresses outside the US (that lack this feature). I should have included
a better sample. But this may get me to 95% ... The way you've slurped the
file makes this perfectly applicable to the rest of the files which is a
REALLY BIG help.

Thanks!

-Alan
 
A

A. Sinan Unur

$input =~ /[A-Z]{2} \d+/g;
....

And so it does very nicely. I think you are making use of the fact
that these all had a pair of capital letters near the end (including
the convenient UK) but there is a 'D.C.' in my data and some other
addresses outside the US (that lack this feature).

Actually, that is a standing for some kind of Country/State Code with
numeric postal code match because all your addresses seemed to end with
that.

The "two capital letters followed by some digits as end of mailing address
indicator" was one of the things that made the code kludgy.

I am sure others will provide better ways once the sun comes up. Good luck.

Sinan.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top