suggestions on intelligent processing of data sets in a file

alt.testing · May 9, 2007

Hi all,
I am writing a script to parse files, and insert data into mysql.
The task is simple enough with files containing "standard" fields.
However; there are many files, and this is not the case.
Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512

Now; other than the obvious and easy solution of breaking up the files
into chunks that are "known" and consistent in themselves, in terms of
data fields, I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)

I don't mind using modules, but would prefer to use ones shipped as
standard. Else, build my own, as I really want to start a bit of "OO",
and this could be a good start.

I have a felling, that creating a class, and building some methods
that can create objects (each respective to a different set) that
reference/manipulate the actual data structures (or something similar)
might be a good approach. This way operations can actually be built on
the fly? Mind you, I've not yet created a module, so this is my first
time. Best approach, or something else, perhaps?

Could anyone suggest some things, that I might try?

tia

Full Context (some rough ideas as a starting point)
===============================================================================
#!/usr/bin/perl

use strict;
use warnings;

use DBI;

my $email_index;
my $name_index;
my $location_index;
my $mobile_index;

my $input_file = $ARGV[0];
my @working_data_array;
my $email;
my $mobile;
my $name;
my $location;
my $counter;

my $email_regex = qr/^
*[a-zA-Z0-9_.-]*@[a-zA-Z0-9_.-]*\.[a-zA-Z0-9_.-]*/;
my $mobile_regex = qr/^ *[04][0-9 ]{8,12}/;
my $name_regex = qr/^ *[a-z -]*/;
my $location_regex = qr/^ *[a-zA-Z0-9 ]*/;

&set_indexes;

open ( IN_FILE, "< $input_file" ) or die "$!";

while ( <IN_FILE> ) {
next unless ( /@/ );
chomp;
@working_data_array = split( /,/ );

$email = $working_data_array[$email_index];
$name = $working_data_array[$name_index];
$location = $working_data_array[$location_index];
$mobile = $working_data_array[$mobile_index];

print "$email";
print "$name";
print "$location";
print "$mobile\n";

}

close IN_FILE;

exit;

sub set_indexes() {
for $counter ( 0 .. $#ARGV ){
$email_index = $counter-1 if ( $ARGV[$counter] =~ /email/ );
$name_index = $counter-1 if ( $ARGV[$counter] =~ /name/ );
$location_index = $counter-1 if ( $ARGV[$counter] =~ /location/ );
$mobile_index = $counter-1 if ( $ARGV[$counter] =~ /mobile/ );
}
}

Tad McClellan · May 9, 2007

Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512

I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)

------------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data:

umper;

while ( <DATA> ) {
chomp;
my %record;
foreach my $part ( split /,\s*/ ) {
if ( $part =~ /^\d+$/ ) # all digits
{ $record{postcode} = $part }
elsif ( $part =~ /^[\d\s]+$/ ) # digits with spaces
{ $record{phone} = $part }
elsif ( $part =~ /@/ ) # contains at-sign
{ $record{email} = $part }
else
{ $record{name} = $part }
}
print Dumper \%record;
}

__DATA__
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512

alt.testing · May 14, 2007

alt.testing@{g}mail.com said:
alt.testing@{g}mail.com said:

Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512

I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)

Click to expand...

------------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data:umper;

while ( <DATA> ) {
chomp;
my %record;
foreach my $part ( split /,\s*/ ) {
if ( $part =~ /^\d+$/ ) # all digits
{ $record{postcode} = $part }
elsif ( $part =~ /^[\d\s]+$/ ) # digits with spaces
{ $record{phone} = $part }
elsif ( $part =~ /@/ ) # contains at-sign
{ $record{email} = $part }
else
{ $record{name} = $part }
}
print Dumper \%record;
}

__DATA__
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512
------------------------

thanks Tad

I write a code to save comment in post on my Facebook forum but it did not work.	0	Aug 30, 2023
Check forms With JavaScript	1	Mar 28, 2023
Collect Excel Data from Website	5	Apr 30, 2022
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
suggestions for printing out a few records of a lengthy file	16	Jun 7, 2010
best way to make a few changes in a large data file	18	Jan 8, 2013
Call perl to store data in DB	4	Aug 31, 2012
Efficient processing of large nuumeric data file	12	Jan 18, 2008

suggestions on intelligent processing of data sets in a file

alt.testing

Tad McClellan

alt.testing

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads