T
Ted Byers
At this point, I don't even know what sort of query to submit to
google to find resources to help find an automated solution to this
problem. I can do it manually, but that is quite tedious as I have a
couple thousand distinct strings to process, and for all I know, I
could have thousands more a month from now.
This is a business problem, in that the data represents company data
in which the company has provided a description of the business.
E.g.:
"barber & hair salon"
"barber /beauty salon"
"barber college"
"barber salon"
"barber school"
"barber shop "
"barber shop & hair salon"
"barber shop and beauty salon"
"barber shop"
"barber shop/ bar & grill "
"barber shop/ hair salon"
"barber shop/natural hair salon"
"barbershop"
"barbershop/hair salon"
"hair salon "
"hair salon "
"hair salon & day spa"
"hair salon and spa"
"hair salon"
"hair salon, nails, tanning, products, bistro, crafts & food
consignments"
"hair salon, spa, herbal clinic, boutique all in 1"
"hair salon/ club"
"hair salon/ spa"
"hair salon/nail shop"
"hair school"
"hair store"
"hair studio and hair product distribution"
"hair supply store"
What I need to do is reduce the number of "business types" in the data
to a few rational choices. I can tell, from visual inspection, that
the businesses with most of the above listed labels, can be grouped as
"personal grooming services". However, the school/college type
businesses would not be appropriately included in such a group.
Neither would those with the last three labels be appropriately
included in such a group.
This task, as I said, is rather easy, but tedious and time consuming,
to handle manually.
The question is, "Is there a perl package or other resource that would
make this task something I can automate?" Or, if you have experience
with this sort of thing, can you advise on a suitable search in google
that will produce more useful information that random noise? I ask
here because this strikes me as a kind of task that perl would be
particularly good at (I have already made a start, using perl, to
clean up the data: e.g. to remove irrelevant characters, spelling
mistakes, &c.).
Any information you can provide would be appreciated.
Thanks
Ted
google to find resources to help find an automated solution to this
problem. I can do it manually, but that is quite tedious as I have a
couple thousand distinct strings to process, and for all I know, I
could have thousands more a month from now.
This is a business problem, in that the data represents company data
in which the company has provided a description of the business.
E.g.:
"barber & hair salon"
"barber /beauty salon"
"barber college"
"barber salon"
"barber school"
"barber shop "
"barber shop & hair salon"
"barber shop and beauty salon"
"barber shop"
"barber shop/ bar & grill "
"barber shop/ hair salon"
"barber shop/natural hair salon"
"barbershop"
"barbershop/hair salon"
"hair salon "
"hair salon "
"hair salon & day spa"
"hair salon and spa"
"hair salon"
"hair salon, nails, tanning, products, bistro, crafts & food
consignments"
"hair salon, spa, herbal clinic, boutique all in 1"
"hair salon/ club"
"hair salon/ spa"
"hair salon/nail shop"
"hair school"
"hair store"
"hair studio and hair product distribution"
"hair supply store"
What I need to do is reduce the number of "business types" in the data
to a few rational choices. I can tell, from visual inspection, that
the businesses with most of the above listed labels, can be grouped as
"personal grooming services". However, the school/college type
businesses would not be appropriately included in such a group.
Neither would those with the last three labels be appropriately
included in such a group.
This task, as I said, is rather easy, but tedious and time consuming,
to handle manually.
The question is, "Is there a perl package or other resource that would
make this task something I can automate?" Or, if you have experience
with this sort of thing, can you advise on a suitable search in google
that will produce more useful information that random noise? I ask
here because this strikes me as a kind of task that perl would be
particularly good at (I have already made a start, using perl, to
clean up the data: e.g. to remove irrelevant characters, spelling
mistakes, &c.).
Any information you can provide would be appreciated.
Thanks
Ted