General natural language analysis question: where do I start?

T

Ted Byers

At this point, I don't even know what sort of query to submit to
google to find resources to help find an automated solution to this
problem. I can do it manually, but that is quite tedious as I have a
couple thousand distinct strings to process, and for all I know, I
could have thousands more a month from now.

This is a business problem, in that the data represents company data
in which the company has provided a description of the business.
E.g.:

"barber & hair salon"
"barber /beauty salon"
"barber college"
"barber salon"
"barber school"
"barber shop "
"barber shop & hair salon"
"barber shop and beauty salon"
"barber shop"
"barber shop/ bar & grill "
"barber shop/ hair salon"
"barber shop/natural hair salon"
"barbershop"
"barbershop/hair salon"
"hair salon "
"hair salon "
"hair salon & day spa"
"hair salon and spa"
"hair salon"
"hair salon, nails, tanning, products, bistro, crafts & food
consignments"
"hair salon, spa, herbal clinic, boutique all in 1"
"hair salon/ club"
"hair salon/ spa"
"hair salon/nail shop"
"hair school"
"hair store"
"hair studio and hair product distribution"
"hair supply store"

What I need to do is reduce the number of "business types" in the data
to a few rational choices. I can tell, from visual inspection, that
the businesses with most of the above listed labels, can be grouped as
"personal grooming services". However, the school/college type
businesses would not be appropriately included in such a group.
Neither would those with the last three labels be appropriately
included in such a group.

This task, as I said, is rather easy, but tedious and time consuming,
to handle manually.

The question is, "Is there a perl package or other resource that would
make this task something I can automate?" Or, if you have experience
with this sort of thing, can you advise on a suitable search in google
that will produce more useful information that random noise? I ask
here because this strikes me as a kind of task that perl would be
particularly good at (I have already made a start, using perl, to
clean up the data: e.g. to remove irrelevant characters, spelling
mistakes, &c.).

Any information you can provide would be appreciated.

Thanks

Ted
 
C

ccc31807

Ted,

Looking at your data, I see that every row contains either 'barber' or
'hair' and that it would be trivial to filter your data according to
this criterion, like this maybe:

push @grooming, $_ if $_ =~ /(barber|hair)/;

Obviously, you need some sets of eyes to decide if a 'barber school'
or 'hair supply store' should be included. My approach might be to use
automation to do some gross sorting and use humans to fine tune your
data.

At the same time, you might develop some heuristics to improve your
automation, realizing that you can't depend on automation for absolute
perfection.

CC
 
T

Ted Byers

Ted,

Looking at your data, I see that every row contains either 'barber' or
'hair' and that it would be trivial to filter your data according to
this criterion, like this maybe:

push @grooming, $_ if $_ =~ /(barber|hair)/;

Obviously, you need some sets of eyes to decide if a 'barber school'
or 'hair supply store' should be included. My approach might be to use
automation to do some gross sorting and use humans to fine tune your
data.

At the same time, you might develop some heuristics to improve your
automation, realizing that you can't depend on automation for absolute
perfection.

CC

Thanks.

I had noticed, but that was but one illustrative example selection,
and in fact going through the rest of the data since I originally
posted, I found other items that ought to be grouped with barber
shops, but which include neither hair nor barber. I have, in fact, a
file with almost 3000 records covering every imaginable kind of
business, and some for which I have no idea what the business actually
does.

As we're looking at a "simple" classification with something of the
order of 100 logical groups, it would be at least as time consuming to
manually come up with a filter for each group as it is to simply
manually reclassify each using any decent spreadsheet. I was hoping
that there was a package, with a dictionary, that was able to produce
a relation between a set of phrases and a set of synonymous words that
would accelerate the process.

Thanks again,

Ted
 
J

Jürgen Exner

Ted Byers said:
At this point, I don't even know what sort of query to submit to
google to find resources to help find an automated solution to this
problem. I can do it manually, but that is quite tedious as I have a
couple thousand distinct strings to process, and for all I know, I
could have thousands more a month from now.

This is a business problem, in that the data represents company data
in which the company has provided a description of the business.
E.g.:

"barber & hair salon"
"barber /beauty salon"
"barber college"
"barber salon"
"barber school"
"barber shop "
"barber shop & hair salon"
"barber shop and beauty salon"
"barber shop"
"barber shop/ bar & grill "
"barber shop/ hair salon"
"barber shop/natural hair salon"
"barbershop"
"barbershop/hair salon"
"hair salon "
"hair salon "
"hair salon & day spa"
"hair salon and spa"
"hair salon"
"hair salon, nails, tanning, products, bistro, crafts & food
consignments"
"hair salon, spa, herbal clinic, boutique all in 1"
"hair salon/ club"
"hair salon/ spa"
"hair salon/nail shop"
"hair school"
"hair store"
"hair studio and hair product distribution"
"hair supply store"

What I need to do is reduce the number of "business types" in the data
to a few rational choices.

There are people who have done that already. You can find their
classification and "business types" in any yellow pages book.
I can tell, from visual inspection, that
the businesses with most of the above listed labels, can be grouped as
"personal grooming services". However, the school/college type
businesses would not be appropriately included in such a group.
Neither would those with the last three labels be appropriately
included in such a group.

No chance but to manually classify them. You might be able to automate
some of it (e.g. for "barber shop"), but otherwise you need semantic
knowledge.

jue
 
P

Peter J. Holzer

[list with a few surprises deleted]

There are people who have done that already. You can find their
classification and "business types" in any yellow pages book.

AIUI coming up with a classification isn't the problem. Assigning free
text descriptions to classes is.
No chance but to manually classify them. You might be able to automate
some of it (e.g. for "barber shop"), but otherwise you need semantic
knowledge.

I agree with that. It may help to have a program which tries to guess
the classification of each term and lets the user manually override it.
The guess could be implemented with bayesian logic or something
similar.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top