Phil Rhoades [2008-04-26 19:13]:
This might work but it would be more difficult without regexs -
the current application does a system call to agrep but of course
it is very slow for large numbers of calls. A typical call is
something like:
agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt
This allows two differences on a minimum amount of information
consisting of last name, first initial and zip code. If I use
the Enumerable version, I would have to use the whole, delimited,
name & address string and increase the differences/distance
number?
i think something like that could work in your case (requires the
Text gem):
File.open('list1.txt').select { |line|
# extract name and zip code from line
line =~ /\A(.*?\|.).*\b(\d{5})\b/ # adjust appropriately!
# name may have two errors, zip only one -- or whatever...
Text::Levenshtein.distance($1, 'Smith|J') <= 2 &&
Text::Levenshtein.distance($2, '12345') <= 1
}
Did you just do that hack now?
that's right. but i just read a bit on agrep's algorithm and it
might be fun to implement it in ruby (though a bit slow, probably).
as an alternative, it might be even worth writing ruby bindings to
agrep. who knows, if time permits... ;-)
- how do I get/install it? (Fedora 8).
well, i don't think that particular implementation suits your needs
and is obviously easily adapted (after all, it's just a select with
an appropriate block utilizing Text::Levenshtein.distance). but you
can get ruby-nuggets from rubyforge (gem install ruby-nuggets), or,
if the new version hasn't found its way onto the mirrors yet, from
our own gem server at
http://prometheus.khi.uni-koeln.de/rubygems/.
cheers
jens