Tim said:
James said:
If the keys in your word_list are more than just words, then the
regexp may not find them all, and thus not replace them all. In
that case you may have to resort to my 2nd regexp which builds
the 5k branch regexp from your actual dictionary keys:
This method on the above dictionary (modified)
d = {
'hello': 'goodbye',
'world': 'python',
'stuff with spaces?': 'tadah!',
}
would create a regexp of
\b(about|world|stuff\ with\ spaces\?)\b
There's a subtle issue of the order of the keywords. Pythons
(ir)regular expressions do not strictly guarantee to find the longest
match. For example,
re.findall(r'\b(about|about\-faced)\b',
'does about-faced match?')
returns ['about'], which is the first complete match, not the longest.
Sorting the keywords into longest-first order avoids the problem.
This has considerable performance implications as len(word_list)
grows, unless you can figure a way to determine that some
replacements are more probable than others and push them to the
front of this regexp, but that's more complex and requires
knowledge of your word-list.
There is another method which is to factor out common prefixes, so the
re module's search-and-backtrack engine never has a choice of multiple
paths. Instead of:
\b(abolished|abolition)\b
we'd use:
\b(aboli(shed|tion)\b
The RE-generator is of course more complex than Tim's one-liner, but I
happen to have code, which I'll include below. It's probably academic,
as I'd agree with Tim that his first solution is the better candidate
at this point.
With my giant prefix-combined RE's, I can re.compile one with 4000
words randomly chosen from a Unix words file, but 5000 results in
"regular expression code size limit exceeded". Tim's version which
doesn't combine prefixes tops out a little lower. This is on 32-bit
Windows, standard distribution. One could, of course, build multiple
smaller RE's, but that starts to suck.
-Bryan Olson
from random import sample
import re
def build_big_re(str_list):
""" Build a non-backtracking regular expression
matching any of the words in str_list.
"""
def trie_insert(table, str):
if str:
submap = table.setdefault(str[0], {})
trie_insert(submap, str[1:])
else:
table[""] = {}
def sequentialize(table, lst):
if table:
keys = table.keys()
keys.sort(key=len, reverse=True)
lst.append('(?:')
seperator = ''
for key in keys:
lst.append(seperator + re.escape(key))
submap = table[key]
while len(submap) == 1:
key = submap.keys()[0]
submap = submap[key]
lst.append(re.escape(key))
sequentialize(submap, lst)
seperator = '|'
lst.append(')')
table = {}
for str in str_list:
trie_insert(table, str)
lst = [r'\b']
sequentialize(table, lst)
lst.append(r'\b')
re_str = "".join(lst)
# print "RE is: '%s'\n" % re_str
return re.compile(re_str, re.IGNORECASE)
def simple_re(str_list):
str_list = sorted(str_list, key=len, reverse=True)
re_str = r'\b(%s)\b' % (
'|'.join(re.escape(s) for s in str_list))
# print "RE is", re_str
return re.compile(re_str, re.IGNORECASE)
def testy():
words_file = r'/usr/share/dict/words' # Common Unix
# words_file = r'C:\z\wordlist.txt' # Just my box
nkeywords = 3500
from random import sample
with open(words_file, 'r') as f:
allwords = [w.strip() for w in f.readlines()]
keywords = sample(allwords, nkeywords)
bigtext = ' '.join(allwords)
print 'Building simple RE...'
simpre = simple_re(keywords)
print 'Run simple re...'
z1 = simpre.findall(bigtext)
print 'Done.\n'
print 'Building complex RE...'
compre = build_big_re(keywords)
print 'Run complex re...'
z2 = compre.findall(bigtext)
print 'Done.'
assert z1 == z2
testy()