AcCeSsDeNiEd said:
Well, I've just given up doing this programmatically.
I've taken a closer look at the naming conventions.
One method I thought off was to split the name from the numbers.
But I've come across files that do not have numbers just after the
client's name.
The client name always comes on the left of the filename, but the rest
of the filename is just too
'gibberish'.
Not gonna happen. At least not until computers are capable of AI.
Thx for the help anyways.
My coy will just have to hire temp staff to clean up this mess.
Btw, we have 400k files.
So good luck on the manual process.
A list of valid names (even a modest one) could help sorting out the
clear cases. If everything in the formats "first middle last",
"first last" and "first middle" with verified "first" and "last"
(and middle something like /[[:upper:]]\./ was accepted automatically,
that could reduce the amount of manual processing considerably. I am
appending a sketch of how this could work.
Name lists are available from the US Census Bureau, typical file names
are dist.all.last, dist.female.first, and dist.male.first.
Anno
#!/usr/bin/perl
use strict; use warnings; $| = 1; # @^~`
use Vi::QuickFix;
my ( %first, %last);
my $namedir = "$ENV{ HOME}/dict/us-census-names";
my $in;
open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.female.first";
@first{ map /(\S+)/, <$in> } = ();
open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.male.first";
@first{ map /(\S+)/, <$in> } = ();
open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.all.last";
@last{ map /(\S+)/, <$in> } = ();
my ( @accepted, @rejected);
while ( <DATA> ) {
chomp;
my ( $first, $middle, $last) = split;
unless ( exists $first{ uc $first} ) {
push @rejected, "$first $middle $last";
next;
}
if ( $middle =~ /[[:upper:]]\./ ) {
if ( exists $last{ uc $last} ) {
push @accepted, "$first $middle $last";
}
else {
push @accepted, "$first, $middle";
}
}
else {
$last = $middle;
if ( exists $last{ uc $last} ) {
push @accepted, "$first $last";
}
else {
push @rejected, "$first $last";
}
}
}
print "accepted:\n";
print "$_\n" for @accepted;
print "\nrejected:\n";
print "$_\n" for @rejected;
__DATA__
Mike 12345.pdf
Mike G. 2332445-withdrawal.pdf
Mike G. 12345.pdf
Mike G. Johnson 12345.pdf
Mike F. Smith 12345.pdf
Mike F. Jones 12345.pdf
Mike F. Jones 12345 (01).pdf
Mike F. Jones 12345 (02).pdf
Mike F. 2332445-withdrawal.pdf
Mike F. 434324.sign.pdf
Mike F. 434324.everywhere_a_sign.pdf
Mike 12345.pdf