How to build an index of phrases in a phrase/sentence?

D

Dan Fitzpatrick

I am trying to build an indexing structure on some phrases. Most phrases
will have 2 - 5 parts (words). The resulting array will be dumped into
an index to find the matching phrases. I don't want to do wildcard
searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches
for "is some" or "some text", I want it to find this phrase. I don't
want a search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case,
"is some". But when the original phrase has more parts, then more middle
parts are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data
structure for retrieving a word or an exact phrase within a
phrase/sentence without wild-carding the search.

Thanks,

Dan
 
J

James Edward Gray II

I am trying to build an indexing structure on some phrases. Most
phrases will have 2 - 5 parts (words). The resulting array will be
dumped into an index to find the matching phrases. I don't want to
do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone
searches for "is some" or "some text", I want it to find this
phrase. I don't want a search for "is text" to find this phrase
though.

Perhaps I'm not understanding, but how does the following fail to
meet your needs?

"This is some text".match(/is some/i)

James Edward Gray II
 
A

Ara.T.Howard

I am trying to build an indexing structure on some phrases. Most phrases
will have 2 - 5 parts (words). The resulting array will be dumped into an
index to find the matching phrases. I don't want to do wildcard searching
on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for
"is some" or "some text", I want it to find this phrase. I don't want a
search for "is text" to find this phrase though.

Perhaps I'm not understanding, but how does the following fail to meet your
needs?

"This is some text".match(/is some/i)

exponential run time vs. O(1) runtime once the index is built maybe?

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================
 
D

Dan Fitzpatrick

I have a few million phrases to index so I don't want to loop through
every phrase and test it with a regular expression.

Below is my working solution. Sometimes just describing the problem
helps the brain to take a new approach:

text = "This is some text"
parts = text.split(/\W/)
phrases = []
parts.each_index do |s|
s.upto(parts.size - 1){|e| phrases.push(parts.slice(s..e))}
end
phrases.each{|p| puts p.join(' ')}

Dan


I am trying to build an indexing structure on some phrases. Most
phrases will have 2 - 5 parts (words). The resulting array will be
dumped into an index to find the matching phrases. I don't want to do
wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone
searches for "is some" or "some text", I want it to find this phrase.
I don't want a search for "is text" to find this phrase though.


Perhaps I'm not understanding, but how does the following fail to meet
your needs?

"This is some text".match(/is some/i)

James Edward Gray II
 
A

Ara.T.Howard

I am trying to build an indexing structure on some phrases. Most phrases will
have 2 - 5 parts (words). The resulting array will be dumped into an index to
find the matching phrases. I don't want to do wildcard searching on the
resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for
"is some" or "some text", I want it to find this phrase. I don't want a
search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case, "is
some". But when the original phrase has more parts, then more middle parts
are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data structure
for retrieving a word or an exact phrase within a phrase/sentence without
wild-carding the search.

maybe

harp:~ > cat a.rb
class String
def subphrases delim = %r/\s+/
words = strip.split delim
return [] if words.empty?
words.inject([]){|a,w| a << [a[-1],w].compact.join(' ')} +
words[1..-1].join(' ').subphrases
end
end

p "This is some text".subphrases
p "foo".subphrases
p "".subphrases

harp:~ > ruby a.rb
["This", "This is", "This is some", "This is some text", "is", "is some", "is some text", "some", "some text", "text"]
["foo"]
[]

you may want to check out glimpse.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================
 
G

Gene Tani

L

Luca Pireddu

Dan said:
I am trying to build an indexing structure on some phrases. Most phrases
will have 2 - 5 parts (words). The resulting array will be dumped into
an index to find the matching phrases. I don't want to do wildcard
searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches
for "is some" or "some text", I want it to find this phrase. I don't
want a search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case,
"is some". But when the original phrase has more parts, then more middle
parts are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data
structure for retrieving a word or an exact phrase within a
phrase/sentence without wild-carding the search.

Thanks,

Dan

I think what you want is a suffix tree.

http://en.wikipedia.org/wiki/Suffix_tree
http://www.google.com/search?q=suffix+tree&ie=UTF-8&oe=UTF-8

Luca
 
G

Gavin Kistner

--Apple-Mail-2--655577882
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

Oops, I sent that last reply without performance results.

I loaded in a file with 496 words and calculated all the sub-phrases,
and measured the time needed to 'parse' the information, and how much
memory was used. I then timed how long it took to match a sub-phrase
in the middle of the file, and also to 'find' a phrase that didn't
exist.

user system total real
create array: 12.910000 0.950000 13.860000 ( 15.403606)
run 100: 15.570000 0.220000 15.790000 ( 17.475724)
array - 158MB of VM

user system total real
create set: 16.040000 1.100000 17.140000 ( 21.742738)
run 100k: 0.910000 0.000000 0.910000 ( 1.088728)
set - 159MB of VM

user system total real
create matcher: 85.430000 1.340000 86.770000 ( 96.524512)
run 100k: 10.050000 0.160000 10.210000 ( 11.245722)
matcher - 68MB of VM

Note that the array was measuring only 100 iterations of the 2-phrase
match, while the other two performed 100 *thousand* iterations.


The array technique is thus over 10,000 slower to find a match than
the technique using the Set. and about 1,000 slower than the Trie
version, but does setup the data structure more quickly than either.


The Trie method's memory use should also increase at a slower rate
than the others as the source string increases, but I don't know how
to use Ruby to measure VM of a process, so I can't make a simple
automated test to graph this.


--
"Despite the surge of power you feel upon learning Ruby,
resist the urge to trip others or slap them in the bald head.
DO NOT LORD YOUR RUBYNESS OVER OTHERS!"
- Why the Lucky Stiff


--Apple-Mail-2--655577882--
 
G

Gavin Kistner

--Apple-Mail-2--641257314
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

Direct Answer
-------------------------------------------------
def phrases( string )
pieces = string.split( /\s/ )
out = []
pieces.each_index{ |start_index|
(start_index+1).upto( pieces.length ){ |end_index|
out << pieces[start_index...end_index].join(' ')
}
}
out
end

all = phrases( "It's the end of the world as we know it." )
p all
#=> ["It's", "It's the", "It's the end", "It's the end of", "It's the
end of the", "It's the end of the world", "It's the end of the world
as", "It's the end of the world as we", "It's the end of the world as
we know", "It's the end of the world as we know it.", "the", "the
end", "the end of", "the end of the", "the end of the world", "the
end of the world as", "the end of the world as we", "the end of the
world as we know", "the end of the world as we know it.", "end", "end
of", "end of the", "end of the world", "end of the world as", "end of
the world as we", "end of the world as we know", "end of the world as
we know it.", "of", "of the", "of the world", "of the world as", "of
the world as we", "of the world as we know", "of the world as we know
it.", "the", "the world", "the world as", "the world as we", "the
world as we know", "the world as we know it.", "world", "world as",
"world as we", "world as we know", "world as we know it.", "as", "as
we", "as we know", "as we know it.", "we", "we know", "we know it.",
"know", "know it.", "it."]

p all.include?( "end of the world" )
#=> true

p all.include?( "end the world" )
#=> false



Better/Faster
-------------------------------------------------
def phrase_matches( string )
require 'set'
pieces = string.split( /\s/ )
out = Set.new
pieces.each_index{ |start_index|
(start_index+1).upto( pieces.length ){ |end_index|
out.add( pieces[start_index...end_index].join(' ') )
}
}
out
end

all = phrase_matches( "It's the end of the world as we know it." )
p all
p all.include?( "end of the world" )
p all.include?( "end the world" )



Complex-But-Memory-Efficient Answer
-------------------------------------------------
class TrieNode
attr_accessor :children

def initialize
@children = {}
end

def add_path( array )
node = self
array.each{ |item| node = node.children[ item ] ||=
TrieNode.new }
end

def includes_path?( array )
node = self
array.each{ |item| return false unless node = node.children
[ item ] }
true
end

def to_hier( depth=0 )
tabs = "-"*depth
out = ''
@children.each{ |char,node|
out << "#{tabs}#{char}\n"
out << node.to_hier( depth+1 )
}
out
end
end

class PhraseMatcher
MATCH_WORDS = /[a-z']+/

def initialize( string )
@root = TrieNode.new
pieces = string.downcase.scan( MATCH_WORDS )
pieces.each_index{ |start_index|
(start_index+1).upto( pieces.length ){ |end_index|
@root.add_path( pieces[start_index...end_index] )
}
}
end

def includes_phrase?( string )
@root.includes_path?( string.scan( MATCH_WORDS) )
end
end

sub_phrases = PhraseMatcher.new( "It's the end of the world as we
know it, and I feel fine." )

p sub_phrases.includes_phrase?( "end of the world" )
#=> true

p sub_phrases.includes_phrase?( "end the world" )
#=> false

sub_phrases.instance_eval{ puts @root.to_hier }
#=> it
#=> -and
#=> --i
#=> ---feel
#=> ----fine
#=> world
#=> -as
#=> --we
#=> ---know
#=> ----it
#=> -----and
#=> ------i
#=> -------feel
#=> --------fine
#=> and
#=> -i
#=> --feel
#=> ---fine
#=> of
#=> -the
#=> --world
#=> ---as
#=> ----we
#=> -----know
#=> ------it
#=> -------and
#=> --------i
#=> ---------feel
#=> ----------fine
#=> it's
#=> -the
#=> --end
#=> ---of
#=> ----the
#=> -----world
#=> ------as
#=> -------we
#=> --------know
#=> ---------it
#=> ----------and
#=> -----------i
#=> ------------feel
#=> -------------fine
#=> we
#=> -know
#=> --it
#=> ---and
#=> ----i
#=> -----feel
#=> ------fine
#=> know
#=> -it
#=> --and
#=> ---i
#=> ----feel
#=> -----fine
#=> end
#=> -of
#=> --the
#=> ---world
#=> ----as
#=> -----we
#=> ------know
#=> -------it
#=> --------and
#=> ---------i
#=> ----------feel
#=> -----------fine
#=> the
#=> -world
#=> --as
#=> ---we
#=> ----know
#=> -----it
#=> ------and
#=> -------i
#=> --------feel
#=> ---------fine
#=> -end
#=> --of
#=> ---the
#=> ----world
#=> -----as
#=> ------we
#=> -------know
#=> --------it
#=> ---------and
#=> ----------i
#=> -----------feel
#=> ------------fine
#=> as
#=> -we
#=> --know
#=> ---it
#=> ----and
#=> -----i
#=> ------feel
#=> -------fine
#=> fine
#=> i
#=> -feel
#=> --fine
#=> feel
#=> -fine


--Apple-Mail-2--641257314--
 
G

Gavin Kistner

One last followup (sorry, I'm bored onboard a plane) :)

I did one manual test of RAM comparing the VM used by the Set storage
versus the Trie storage, comparing the previously-measured 496 word
document with a document that had 1007 words. The results were as I
expected:

469 words:
create set: 16.040000 1.100000 17.140000 ( 21.742738)
159MB of VM

create matcher: 85.430000 1.340000 86.770000 ( 96.524512)
68MB of VM


1007 words:
create set: 137.470000 9.400000 146.870000 (166.828737)
~1GB of VM

create matcher: 746.690000 11.050000 757.740000 (806.450292)
149MB of VM

Conclusion: if you have the RAM to spare, the Set-based approach is
quite speedy, but it gets greedy as your full phrase base grows. If
you need to save some memory and can spare the time, go with the Trie
based approach.



Now, having done all this work...if all you want is sub-phrase
matching, why not use a regexp?


469 words:
user system total real
create clean string: 0.010000 0.010000 0.020000 ( 0.003050)
run 100k matches: 10.750000 0.140000 10.890000 ( 15.839430)
28MB of VM

1007 words:
user system total real
create clean string: 0.010000 0.010000 0.020000 ( 0.432572)
run 100k matches: 19.350000 0.200000 19.550000 ( 27.612700)
28MB of VM



[Slim:~/Desktop/Match Phrases] gavinkis% cat regexp.rb
require 'benchmark'

cleaned = nil
matcher = Regexp.new( "\\b#{ARGV[1]}\\b" )

Benchmark.bm( 20 ){ |x|
x.report( "create clean string:" ){
cleaned = IO.read( ARGV[0] ).downcase.scan( /[a-z']
+/ ).join( ' ' )
}
x.report( "run 100k matches:"){
100_000.times{
cleaned =~ matcher
cleaned =~ /the brown fox/
}
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,810
Latest member
Kassie0918

Latest Threads

Top