Howto get array.agrep (NOT array.grep)

Phil Rhoades · Apr 26, 2008

People,

Is there some way to get agrep working with Ruby arrays? - agrep has
some nice, useful features that grep doesn't . .

Thanks,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61

0)2-8221-9599
E-mail: (e-mail address removed)

Phrogz · Apr 26, 2008

Phil said:
Is there some way to get agrep working with Ruby arrays? - agrep has
some nice, useful features that grep doesn't . .

Perhaps if you explained what this mysterious 'agrep' was, we might
help.
Something from another language? A unix utility?

Give us a sample array, and what you'd like the result to be after
calling this method on that array.

Phil Rhoades · Apr 26, 2008

Perhaps if you explained what this mysterious 'agrep' was, we might
help.
Something from another language? A unix utility?

Give us a sample array, and what you'd like the result to be after
calling this method on that array.

NAME
agrep - print lines approximately matching a pattern

SYNOPSIS
agrep [OPTION]... PATTERN [FILE]...

DESCRIPTION
Searches for approximate matches of PATTERN in each FILE or
standard input. Exam-
ple: 'agrep -2 optimize foo.txt' outputs all lines in file
'foo.txt' that match
"optimize" within two errors. E.g. lines which contain
"optimise", "optmise", and
"opitmize" all match.

--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61

0)2-8221-9599
E-mail: (e-mail address removed)

Simon Krahnke · Apr 26, 2008

* Phil Rhoades said:
NAME
agrep - print lines approximately matching a pattern

Enurable#grep can do that, if you pass it the right block. When you pass
a block to grep it's the block's job to match the elements.

Now the interesting question is: How would that block look like?

mfg, simon .... l

Ryan Davis · Apr 26, 2008

Enurable#grep can do that, if you pass it the right block. When you
pass
a block to grep it's the block's job to match the elements.
no.

enum.grep(pattern) => array
enum.grep(pattern) {| obj | block } => array
------------------------------------------------------------------------
Returns an array of every element in _enum_ for which +Pattern
===
element+. If the optional _block_ is supplied, each matching
element is passed to it, and the block's result is stored in the
output array.

The block just morphs the result, it doesn't morph the match.

Jens Wille · Apr 26, 2008

hi phil!

if all you want is getting all the strings within a certain edit
distance of your pattern, have a look at [1]. it doesn't support
regular expressions in the pattern because i don't how to achieve
that easily without re-implementing agrep's algorithm ;-) it's
really just a quick hack that might get you started, hopefully.

[1]
<http://prometheus.rubyforge.org/ruby-nuggets/classes/Enumerable.html#M000091>

cheers
jens

--
Jens Wille, Dipl.-Bibl. (FH)
prometheus - Das verteilte digitale Bildarchiv für Forschung & Lehre
Kunsthistorisches Institut der Universität zu Köln
Albertus-Magnus-Platz, D-50923 Köln
Tel.: +49 (0)221 470-6668, E-Mail: (e-mail address removed)
http://www.prometheus-bildarchiv.de/

Phil Rhoades · Apr 26, 2008

jens,

hi phil!

if all you want is getting all the strings within a certain edit
distance of your pattern, have a look at [1]. it doesn't support
regular expressions in the pattern because i don't how to achieve
that easily without re-implementing agrep's algorithm ;-) it's
really just a quick hack that might get you started, hopefully.

[1]
<http://prometheus.rubyforge.org/ruby-nuggets/classes/Enumerable.html#M000091>

This might work but it would be more difficult without regexs - the
current application does a system call to agrep but of course it is very
slow for large numbers of calls. A typical call is something like:

agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt

This allows two differences on a minimum amount of information
consisting of last name, first initial and zip code. If I use the
Enumerable version, I would have to use the whole, delimited, name &
address string and increase the differences/distance number?

Did you just do that hack now? - how do I get/install it? (Fedora 8).

Thanks,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61

0)2-8221-9599
E-mail: (e-mail address removed)

Jens Wille · Apr 26, 2008

Phil Rhoades [2008-04-26 19:13]:

This might work but it would be more difficult without regexs -
the current application does a system call to agrep but of course
it is very slow for large numbers of calls. A typical call is
something like:

agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt

This allows two differences on a minimum amount of information
consisting of last name, first initial and zip code. If I use
the Enumerable version, I would have to use the whole, delimited,
name & address string and increase the differences/distance
number?

i think something like that could work in your case (requires the
Text gem):

File.open('list1.txt').select { |line|
# extract name and zip code from line
line =~ /\A(.*?\|.).*\b(\d{5})\b/ # adjust appropriately!

# name may have two errors, zip only one -- or whatever...
Text::Levenshtein.distance($1, 'Smith|J') <= 2 &&
Text::Levenshtein.distance($2, '12345') <= 1
}

Did you just do that hack now?

that's right. but i just read a bit on agrep's algorithm and it
might be fun to implement it in ruby (though a bit slow, probably).
as an alternative, it might be even worth writing ruby bindings to
agrep. who knows, if time permits... ;-)

- how do I get/install it? (Fedora 8).

well, i don't think that particular implementation suits your needs
and is obviously easily adapted (after all, it's just a select with
an appropriate block utilizing Text::Levenshtein.distance). but you
can get ruby-nuggets from rubyforge (gem install ruby-nuggets), or,
if the new version hasn't found its way onto the mirrors yet, from
our own gem server at http://prometheus.khi.uni-koeln.de/rubygems/.

cheers
jens

Phil Rhoades · Apr 26, 2008

jens,

Phil Rhoades [2008-04-26 19:13]:

This might work but it would be more difficult without regexs -
the current application does a system call to agrep but of course
it is very slow for large numbers of calls. A typical call is
something like:

agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt

This allows two differences on a minimum amount of information
consisting of last name, first initial and zip code. If I use
the Enumerable version, I would have to use the whole, delimited,
name & address string and increase the differences/distance
number?

Click to expand...

i think something like that could work in your case (requires the
Text gem):

File.open('list1.txt').select { |line|
# extract name and zip code from line
line =~ /\A(.*?\|.).*\b(\d{5})\b/ # adjust appropriately!

# name may have two errors, zip only one -- or whatever...
Text::Levenshtein.distance($1, 'Smith|J') <= 2 &&
Text::Levenshtein.distance($2, '12345') <= 1
}

I see what you are doing but this would have to be repeated for the
three different lists (list1.txt, list2.txt, list3.txt) - I guess that
should still be faster than a single system call . .

that's right. but i just read a bit on agrep's algorithm and it
might be fun to implement it in ruby (though a bit slow, probably).

I don't know if it helps but there is this:

http://www.koders.com/ruby/fidCEAEDCAA28D4A59A76ADF20A0DA2A3858438834D.aspx

as an alternative, it might be even worth writing ruby bindings to
agrep. who knows, if time permits... ;-)

I was wondering about something like that but I have never created a
Ruby binding before . .

well, i don't think that particular implementation suits your needs
and is obviously easily adapted (after all, it's just a select with
an appropriate block utilizing Text::Levenshtein.distance). but you
can get ruby-nuggets from rubyforge (gem install ruby-nuggets), or,
if the new version hasn't found its way onto the mirrors yet, from
our own gem server at http://prometheus.khi.uni-koeln.de/rubygems/.

Thanks!

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61

0)2-8221-9599
E-mail: (e-mail address removed)

Jens Wille · Apr 26, 2008

Phil Rhoades [2008-04-26 22:26]:

I see what you are doing but this would have to be repeated for
the three different lists (list1.txt, list2.txt, list3.txt)

well, yeah. but that's not really a problem, is it?

%w[list1.txt list2.txt list3.txt].inject([]) { |matches, file|
matches + File.open(file).select { |line|
# ...same as before...
}
}

I don't know if it helps but there is this:

http://www.koders.com/ruby/fidCEAEDCAA28D4A59A76ADF20A0DA2A3858438834D.aspx

=> http://amatch.rubyforge.org

silly me!! totally forgot about that one ;-) thanks for the reminder!

maybe i'll be able to come up with something that wraps flori's
Amatch into (Enumerable|File)#agrep.

I was wondering about something like that but I have never
created a Ruby binding before . .

neither have i. but that shouldn't stop us, right? ;-)

cheers
jens

Jens Wille · Apr 26, 2008

Jens Wille [2008-04-26 22:45]:

maybe i'll be able to come up with something that wraps flori's
Amatch into (Enumerable|File)#agrep.

that was actually pretty easy and is definitely an improvement (see
ruby-nuggets v0.1.9), but it still won't give us support for regular
expression patterns :-(

i also added IO::agrep, so you would now be able to do:

%w[list1.txt list2.txt list3.txt].inject([]) { |matches, file|
matches + File.agrep(file, /Smith\|J.*12345/, 2)
}

-- if only you had regular expressions at your disposal!

cheers
jens

Phil Rhoades · Apr 26, 2008

jens,

Jens Wille [2008-04-26 22:45]:

maybe i'll be able to come up with something that wraps flori's
Amatch into (Enumerable|File)#agrep.

Click to expand...

that was actually pretty easy and is definitely an improvement (see
ruby-nuggets v0.1.9), but it still won't give us support for regular
expression patterns :-(

i also added IO::agrep, so you would now be able to do:

%w[list1.txt list2.txt list3.txt].inject([]) { |matches, file|
matches + File.agrep(file, /Smith\|J.*12345/, 2)
}

-- if only you had regular expressions at your disposal!

Yes, that would be nice! . . I guess it will be there sometime.

Thanks for looking at this!

Regards,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61

0)2-8221-9599
E-mail: (e-mail address removed)

v1.9 -rprofile -rdebug errors	10	Jun 25, 2008
Multiline (block) CSV file processing	9	Jan 10, 2008
Ruby equivalent to "exec > $logfile 2>&1" in sh script?	2	Dec 1, 2006
Metaprogramming for Tierra-like (artificial life) program?	4	Nov 28, 2007
Curve fitting to data	10	Dec 16, 2007
Retrieving PID running time	7	Dec 8, 2005
A true Ruby compiler (for Linux)	6	Dec 5, 2006
RubyToC Question	12	Jun 5, 2008

Howto get array.agrep (NOT array.grep)

Phil Rhoades

Phrogz

Phil Rhoades

Simon Krahnke

Ryan Davis

Jens Wille

Phil Rhoades

Jens Wille

Phil Rhoades

Jens Wille

Jens Wille

Phil Rhoades

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads