Grouping like items together

AcCeSsDeNiEd · Nov 15, 2005

I have several 10s of thousands files with no directories.

I'm trying to group the 'similar' files together and place them in a directory.

E.g of such files:

Mike F. 2332445-withdrawal.pdf
Mike F. 43565654-letter.pdf
Mike F. 434324.sign.pdf
Dawn M. Yang letter of acceptance.pdf
Dawn M. Yang (01).pdf
Dawn M. Yang 4355434 SOA.pdf

I'm trying to group these files by their names.
The names are not in a fixed format. E.g, not all names may have a middle name.
if these names were in a list, how do I match and group them together? How would I know the group
name?

Thx.

To e-mail, remove the obvious

usenet · Nov 15, 2005

AcCeSsDeNiEd said:
E.g of such files:

Mike F. 2332445-withdrawal.pdf
Mike F. 43565654-letter.pdf
Mike F. 434324.sign.pdf
Dawn M. Yang letter of acceptance.pdf
Dawn M. Yang (01).pdf
Dawn M. Yang 4355434 SOA.pdf

I'm trying to group these files by their names.
The names are not in a fixed format....

This is a dreadful question (meaning it is very hard to ascertain your
intent). The best way to get a good answer is to ask a good question.
You have asked a very bad question, so you will only get a very bad
answer (as I believe PG has already provided).

But, unlike PG, I am here to help you, not berate you. First of all,
you should read the posting guidelines for this group. They can be
found on-line at:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

These guidelines exist for YOUR benefit (because they show you how to
compose effective posts which are much more likely to get effective
responses - without getting flamed).

Next, (at a VERY MINIMUM) you need to tell us EXACTLY what you want to
do. So I am gonna ask you a question. The question is not simply
rhetorical; I want you to sit down at your keyboard and actually type
out an answer and post it here. Here is the question:

What would you want the directory structure to look like (ie, what
would be the names of the subdirectories - give me a complete list of
exactly you want the subdirectory names to be) if your filenames looked
like this (and pay attention to the filenames - some are identical for
quite a number of characters):

Mike 12345.pdf
Mike G. 2332445-withdrawal.pdf
Mike G. 12345.pdf
Mike G. Johnson 12345.pdf
Mike F. Smith 12345.pdf
Mike F. Jones 12345.pdf
Mike F. Jones 12345 (01).pdf
Mike F. Jones 12345 (02).pdf
Mike F. 2332445-withdrawal.pdf
Mike F. 434324.sign.pdf
Mike F. 434324.everywhere_a_sign.pdf
Mike 12345.pdf
Mike Carlson 12345.pdf
Mike C. 12345.pdf

If my question is not clear, let me put it another way: If you were
manually creating directories to organize these filenames, what
directories would you create? I would like you to actually post the
answer to that question so we can better understand your intent.

Tad McClellan · Nov 15, 2005

AcCeSsDeNiEd said:
I'm trying to group the 'similar' files together and place them in a directory.

One step of the solution would be to get them sorted by "name"...

E.g of such files:

Mike F. 2332445-withdrawal.pdf
Mike F. 43565654-letter.pdf
Mike F. 434324.sign.pdf
Dawn M. Yang letter of acceptance.pdf
Dawn M. Yang (01).pdf
Dawn M. Yang 4355434 SOA.pdf

.... so your test data should not be already sorted.

I'm trying to group these files by their names.

Another part of the solution then would be to identify where
the "names" end.

The names are not in a fixed format.

Then you will need to identify every case so that you can write
code that will handle every case.

E.g, not all names may have a middle name.

But you identify only one of the cases, and provide none of that one
case in your test data.

Do you also have:

Mike F. Smith 1234.pdf

where you need it to be grouped with " Mike F."?

You make it too hard to help you...

if these names were in a list, how do I match and group them together?

You need to seperate the "name" from "the rest" to start with.

I will assume that each component of a "name" starts with an
upper case letter, and that the first part after the name
does NOT start with an upper case letter.

If you had lines like the above in a file, then this seems
to do a credible job of identifying the "name" part:

----------------------------------------
#!/usr/bin/perl
use warnings;
use strict;

while ( <> ) {
next unless /^(([A-Z]\S+ )+)/;
chop(my $name = $1);
print "'$name'\n";
}
----------------------------------------

How would I know the group
name?

See above.

AcCeSsDeNiEd · Nov 15, 2005

Impossible. Your files are in a directory. Only exceptions which come
to mind, would be you are running an ENIAC machine, circa 1950, or
an old IBM 600 series machine which stores data on punch cards.

Dude, I didn't mean that in a literal sense.
LOL. I just left out their tiny details so the post wouldn't be too
long.

What I meant was that the files are not grouped properly.
These are basically client files that are kept in each staff's
directory. But within the staff's directory, all the client files are
just dumped there without even being sorted into a folder with regard's
to the client name. It seems that someone thought they would rather just
add the client name to the file name. Costly mistake which I have to
clean up for them now.

1 staff can even have 6k files. So scrolling down the list is getting
rather slow.

It pisses me off that 4 yrs ago I told them recently not to do this and
now I'm still the one that has to end up doing the cleaning up.
Sorry......

Is it you want to create directories based on names but your mind
went up in stinky Chong smoke and you forgot to mention this?

Yes. But these names can only be extracted from the file names.
But how would I know which files are *like* and *what* is like about
them? So that I can create the directory and push the files over to it.

To e-mail, remove the obvious

AcCeSsDeNiEd · Nov 16, 2005

Well, I've just given up doing this programmatically.
I've taken a closer look at the naming conventions.
One method I thought off was to split the name from the numbers.
But I've come across files that do not have numbers just after the client's name.
The client name always comes on the left of the filename, but the rest of the filename is just too
'gibberish'.

Not gonna happen. At least not until computers are capable of AI.

Thx for the help anyways.

My coy will just have to hire temp staff to clean up this mess.

Btw, we have 400k files.
So good luck on the manual process.

To e-mail, remove the obvious

A. Sinan Unur · Nov 16, 2005

Well, I've just given up doing this programmatically.

Please quote some context when you reply.

I've taken a closer look at the naming conventions.
One method I thought off was to split the name from the numbers.
But I've come across files that do not have numbers just after the
client's name. The client name always comes on the left of the
filename, but the rest of the filename is just too 'gibberish'.

Not gonna happen. At least not until computers are capable of AI.

As a first stab, grouping files on the basis of closeness of their
names maybe reduce the amount of work needed.

See if

http://search.cpan.org/~jgoldberg/Text-LevenshteinXS-0.03/LevenshteinXS.pm

helps. I could see myself using something like this to first
distribute files into sub-directories. Then the manual work of checking
for incorrectly identified files ought to be less.

Sinan

ekkehard.horner · Nov 16, 2005

AcCeSsDeNiEd said:
I have several 10s of thousands files with no directories.

I'm trying to group the 'similar' files together and place them in a directory.

E.g of such files:

Mike F. 2332445-withdrawal.pdf
Mike F. 43565654-letter.pdf
Mike F. 434324.sign.pdf
Dawn M. Yang letter of acceptance.pdf
Dawn M. Yang (01).pdf
Dawn M. Yang 4355434 SOA.pdf

I'm trying to group these files by their names.
The names are not in a fixed format. E.g, not all names may have a middle name.
if these names were in a list, how do I match and group them together? How would I know the group
name?

-----------

AcCeSsDeNiEd said:
Well, I've just given up doing this programmatically.
I've taken a closer look at the naming conventions.
One method I thought off was to split the name from the numbers.
But I've come across files that do not have numbers just after the client's name.
The client name always comes on the left of the filename, but the rest of the filename is just too
'gibberish'.

Not gonna happen. At least not until computers are capable of AI.

Thx for the help anyways.

My coy will just have to hire temp staff to clean up this mess.

Btw, we have 400k files.
So good luck on the manual process.

How about starting with a list of Users (storing Name, TragetDir, and
a (growing) list of alias names given as RegExps). First move the 'clear
cases' to the TargetDirs, then view the remaining files to improve the
alias names.

Anno Siegel · Nov 16, 2005

AcCeSsDeNiEd said:
Well, I've just given up doing this programmatically.
I've taken a closer look at the naming conventions.
One method I thought off was to split the name from the numbers.
But I've come across files that do not have numbers just after the
client's name.
The client name always comes on the left of the filename, but the rest
of the filename is just too
'gibberish'.

Not gonna happen. At least not until computers are capable of AI.

Thx for the help anyways.

My coy will just have to hire temp staff to clean up this mess.

Btw, we have 400k files.
So good luck on the manual process.

A list of valid names (even a modest one) could help sorting out the
clear cases. If everything in the formats "first middle last",
"first last" and "first middle" with verified "first" and "last"
(and middle something like /[[:upper:]]\./ was accepted automatically,
that could reduce the amount of manual processing considerably. I am
appending a sketch of how this could work.

Name lists are available from the US Census Bureau, typical file names
are dist.all.last, dist.female.first, and dist.male.first.

Anno

#!/usr/bin/perl
use strict; use warnings; $| = 1; # @^~`
use Vi::QuickFix;

my ( %first, %last);
my $namedir = "$ENV{ HOME}/dict/us-census-names";
my $in;
open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.female.first";
@first{ map /(\S+)/, <$in> } = ();
open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.male.first";
@first{ map /(\S+)/, <$in> } = ();
open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.all.last";
@last{ map /(\S+)/, <$in> } = ();

my ( @accepted, @rejected);
while ( <DATA> ) {
chomp;
my ( $first, $middle, $last) = split;
unless ( exists $first{ uc $first} ) {
push @rejected, "$first $middle $last";
next;
}
if ( $middle =~ /[[:upper:]]\./ ) {
if ( exists $last{ uc $last} ) {
push @accepted, "$first $middle $last";
}
else {
push @accepted, "$first, $middle";
}
}
else {
$last = $middle;
if ( exists $last{ uc $last} ) {
push @accepted, "$first $last";
}
else {
push @rejected, "$first $last";
}
}
}
print "accepted:\n";
print "$_\n" for @accepted;
print "\nrejected:\n";
print "$_\n" for @rejected;

__DATA__
Mike 12345.pdf
Mike G. 2332445-withdrawal.pdf
Mike G. 12345.pdf
Mike G. Johnson 12345.pdf
Mike F. Smith 12345.pdf
Mike F. Jones 12345.pdf
Mike F. Jones 12345 (01).pdf
Mike F. Jones 12345 (02).pdf
Mike F. 2332445-withdrawal.pdf
Mike F. 434324.sign.pdf
Mike F. 434324.everywhere_a_sign.pdf
Mike 12345.pdf

Dr.Ruud · Nov 17, 2005

AcCeSsDeNiEd:

I'm trying to group the 'similar' files together and place them in a
directory.

E.g of such files:

Mike F. 2332445-withdrawal.pdf
Mike F. 43565654-letter.pdf
Mike F. 434324.sign.pdf
Dawn M. Yang letter of acceptance.pdf
Dawn M. Yang (01).pdf
Dawn M. Yang 4355434 SOA.pdf

I'm trying to group these files by their names.
The names are not in a fixed format. E.g, not all names may have a
middle name.
if these names were in a list, how do I match and group them
together? How would I know the group name?

#!/usr/bin/perl
use strict; use warnings;

{ local ($,,$\) = ("\t", "\n");

for (<>) {

chomp;

/^( # start a capturing group
[[:upper:]] # a Word should start with a capital
[[:lower:][

unct:]]+
# followed by 1 or more specific chars
(?: # start a non-capturing group
\s+ # 1 or more wsp chars
[[:upper:]] # followed by another Word
[[:lower:][

unct:]]+
)* # 0 or more trailing Words
)/x; # end of capturing group

print "[$1]", $_;
}
}

$ names.pl < names.inp
[Mike F.] Mike F. 2332445-withdrawal.pdf
[Mike F.] Mike F. 43565654-letter.pdf
[Mike F.] Mike F. 434324.sign.pdf
[Dawn M. Yang] Dawn M. Yang letter of acceptance.pdf
[Dawn M. Yang] Dawn M. Yang (01).pdf
[Dawn M. Yang] Dawn M. Yang 4355434 SOA.pdf

You can use a hash to convert from name to group, with entries like:

"Mike F." => "Mike_Forster"
"Dawn M. Yang" => "Dawn_Yang"

AcCeSsDeNiEd · Nov 17, 2005

Name lists are available from the US Census Bureau, typical file names
are dist.all.last, dist.female.first, and dist.male.first.

Thx for the help. But more than half the names are not English.
And the whole filename is in caps.
Sigh...

To e-mail, remove the obvious

PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011
America's Best Kept Secret!!!	0	Jul 24, 2009
classroom constraint satisfaction problem	3	Oct 15, 2006
how to capture locally, the data content of an HTM form?	28	Nov 17, 2009
Cosmic Indulgence	0	Feb 28, 2008
IT WORKS!	0	Jun 3, 2009
THIS WORKS - IT'S BEEN PROVEN! LET'S MAKE THIS HAPPEN!	1	Aug 28, 2008

Grouping like items together

AcCeSsDeNiEd

usenet

Tad McClellan

AcCeSsDeNiEd

AcCeSsDeNiEd

A. Sinan Unur

ekkehard.horner

Anno Siegel

Dr.Ruud

AcCeSsDeNiEd

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads