string search function

G

gyromagnetic

Hi,
I have written a function that searches a text string for various
words. The text is searched using a boolean 'and' or a boolean 'or' of
the input list of search terms.

Since I need to use this function for many long strings and many
search words, I would like to use as efficient a method as possible.

Are there improvements that can be made to the code below? Are there
better alternatives?
I am currently using Python 2.3.

Thanks.

-g


-----

def bsearch(fterms, stext, btype='AND'):
if btype == 'AND': # boolean 'and' search
found = True
for f in fterms:
if f not in stext:
found = False
break
else: # boolean 'or' search
found = False
for f in fterms:
if f in stext:
found = True
break

return found



if __name__ == '__main__':
stext = "hello to you all"
terms = ['hello', 'goodbye']
btype = 'OR'

if bsearch(terms, stext, btype):
print 'found'
else:
print 'not found'
 
C

Christopher T King

Are there improvements that can be made to the code below? Are there
better alternatives?

There are two ways I can think of:

1)

You can replace the 'or' search with a regular expression:

import re

def bsearch(fterms, stext, btype='AND'):
if btype == 'AND': # boolean 'and' search
...
else: # boolean 'or' search
found = bool(re.search(fterms.join('|'),stext))

return found

Assuming Python's regexps are somewhat optimized, this should be about the
fastest search you can get with arbitrary strings. Unfortunately, there's
no direct 'and' equivalent with regexps (it can be done, but it's helluva
ugly). If you want to optimize 'and', you'll need to use whatever
advanced algorithms our friends at Google use ;)

2)

Assuming your search terms don't contain whitespace, and you want to
compare only entire words (not substrings), replace the string search with
a list search, simply by adding the line 'stext = stext.split()' to the
top of the search function. This will greatly speed up the search by
avoiding checking partial words. The same effect can be achieved with the
regexp above by appending '\s' to the beginning and end of it.

Continuing along this line (whole words, no whitespace), you can replace
the lists with sets for an even greater speedup. The resulting code would
look like this (with a bit of code stolen from sets.py):

from sets import Set

def ispartialsubset(self, other):
"""Report whether another set contains at least one member of this set."""
self._binary_sanity_check(other)
for elt in ifilter(other._data.has_key, self):
return True
return False

Set.ispartialsubset = ispartialsubset

def bsearch(fterms, stext, btype='AND'):
stext = stext.split()
if btype == 'AND': # boolean 'and' search
found = Set(fterms).issubset(Set(stext))
else: # boolean 'or' search
found = Set(fterms).ispartialsubset(Set(stext))
return found

If the same stext or fterms is used multiple times, you could move the
..split() and/or the Set() transformations outside of the function, so they
only have to be done once.

Again, this method (using sets) only works if your words contain no
whitespace, and you're not interested in matching parts of words. It's
/much/ faster than a string search, though.

Hope this helps!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,664
Latest member
RoseannBow

Latest Threads

Top