Reading poorly structured data

Alan Mead · Dec 8, 2004

I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:

Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
.... etc..

So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.

In a later file dozens of records appear on the same line.

I'd like to output

lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
[email protected]

Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.

-Alan

my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('','','','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
....
}

A. Sinan Unur · Dec 8, 2004

I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:

Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
... etc..

Here is somewhat of a kludge that "works" for the snippet you posted. Hope
this helps.

#! perl

use strict;
use warnings;

use File::Slurp;

my $input = read_file(\*DATA);
$input =~ tr/\n/ /;

my @records;

while(length $input) {
my %record;
$record{lname} = grab_name($input);
$record{fname} = grab_name($input);
$input =~ /[A-Z]{2} \d+/g;
$record{address} = substr $input, 0, pos($input);
$input = substr $input, pos($input);
if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
$record{email} = $1;
$input = substr $input, pos $input;
}
push @records, \%record;
}

use Data:

umper;
print Dumper \@records;

sub grab_name {
my $off = index $_[0], ',';
my $name = substr $_[0], 0, $off;
$_[0] = substr $_[0], $off + 2;
return $name;
}

__DATA__
Bush, George, President, 1 White House Way, Washington,
DC 00000; (e-mail address removed)
Kerry, John, 1 Main, Detroit, MI 00000; (e-mail address removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (e-mail address removed)
Blair, Tony, 1 Downing Street, London, UK 0000000

Alan Mead · Dec 8, 2004

Here is somewhat of a kludge that "works" for the snippet you posted. Hope
this helps.

#! perl
use strict;
use warnings;
use File::Slurp;
my $input = read_file(\*DATA);
$input =~ tr/\n/ /;
my @records;
while(length $input) {
my %record;
$record{lname} = grab_name($input);
$record{fname} = grab_name($input);
$input =~ /[A-Z]{2} \d+/g;
$record{address} = substr $input, 0, pos($input);
$input = substr $input, pos($input);
if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
$record{email} = $1;
$input = substr $input, pos $input;
}
push @records, \%record;
}

[...]

And so it does very nicely. I think you are making use of the fact that
these all had a pair of capital letters near the end (including the
convenient UK) but there is a 'D.C.' in my data and some other
addresses outside the US (that lack this feature). I should have included
a better sample. But this may get me to 95% ... The way you've slurped the
file makes this perfectly applicable to the rest of the files which is a
REALLY BIG help.

Thanks!

-Alan

A. Sinan Unur · Dec 8, 2004

$input =~ /[A-Z]{2} \d+/g;

Click to expand...

....

And so it does very nicely. I think you are making use of the fact
that these all had a pair of capital letters near the end (including
the convenient UK) but there is a 'D.C.' in my data and some other
addresses outside the US (that lack this feature).

Actually, that is a standing for some kind of Country/State Code with
numeric postal code match because all your addresses seemed to end with
that.

The "two capital letters followed by some digits as end of mailing address
indicator" was one of the things that made the code kludgy.

I am sure others will provide better ways once the sun comes up. Good luck.

Sinan.

A. Sinan Unur · Dec 8, 2004

Actually, that is a standing for some kind of Country/State Code with

^^^^^^^^
I meant 'stand-in'. Sorry.

Sinan

'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
Increment in nested loop	6	Dec 2, 2007
URI queries with varied amounts of named values	10	Apr 3, 2009
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
Trying to add threading to parse a .txt file.	4	Jun 9, 2008
KOREAN MOON revolves around BUSH	2	Apr 14, 2007
reading from text file	8	Sep 7, 2005
Design mini-lanugage for data input	3	Mar 21, 2006

Reading poorly structured data

Alan Mead

A. Sinan Unur

Alan Mead

A. Sinan Unur

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads