fast regex

J

james_027

hi,

I was working with regex on a very large text, really large but I have
time constrained. Does python has any other regex library or string
manipulation library that works really fast?

Thanks,
James
 
J

Javier Collado

Hello,

2010/5/6 james_027 said:
I was working with regex on a very large text, really large but I have
time constrained. Does python has any other regex library or string
manipulation library that works really fast?

re2 (http://code.google.com/p/re2/) is suppossed to be faster than the
standard library in python. Unfortunately, it's implemented in C++ and
there isn't an official python wrapper for it. However, you can find a
wrapper that can be useful for you here:
http://github.com/facebook/pyre2

Best regards,
Javier
 
J

John Bokma

james_027 said:
I was working with regex on a very large text, really large but I have
time constrained. Does python has any other regex library or string
manipulation library that works really fast?

Hard to answer without seeing your regex and requirements first.
Your question is like: I had a meal yesterday and it upset my
stomach. Can you suggest a different meal for today?
 
J

james_027

Hard to answer without seeing your regex and requirements first.
Your question is like: I had a meal yesterday and it upset my
stomach. Can you suggest a different meal for today?

I am doing something like this

for key, value in words_list.items():
compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
search = compile.sub(value, content)

where the content is a large text about 500,000 characters and the
word list is about 5,000

Any optimization for the code above?
 
J

james_027

Hard to answer without seeing your regex and requirements first.
Your question is like: I had a meal yesterday and it upset my
stomach. Can you suggest a different meal for today?

I am doing something like this

for key, value in words_list.items():
compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
search = compile.sub(value, content)

where the content is a large text about 500,000 characters and the
word list is about 5,000

Any optimization for the code above?
 
T

Tim Chase

for key, value in words_list.items():
compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
search = compile.sub(value, content)

where the content is a large text about 500,000 characters and the
word list is about 5,000

You don't specify what you want to do with "search" vs.
"content"...are you then reassigning

content = search

so that subsequent replacements happen? (your current version
creates "search", only to discard it)

My first thought would be to make use of re.sub()'s ability to
take a function and do something like

# a regexp that finds all possible
# matches/words of interest
r = re.compile(r'\b[a-zA-Z]+\b')
def replacer(match):
text = match.group(0)
# assuming your dict.keys() are all lowercase:
return word_list.get(text.lower(), text)
results = r.sub(replacer, content)

This does a replacement for every word in the input corpus
(possibly with itself), but only takes one pass through the
source text. If you wanted to get really fancy (and didn't butt
up against the max size for a regexp), I suppose you could do
something like

r = re.compile(r'\b(%s)\b' % (
'|'.join(re.escape(s) for s in words_list.keys())),
re.IGNORECASE)
def replacer(match):
return word_list[match.group(0).lower()] # assume lower keys
results = r.sub(replacer, content)

which would only do replacements on your keys rather than every
"word" in your input, but I'd start with the first version before
abusing programmatic regexp generation.

-tkc
 
J

james_027

Hard to answer without seeing your regex and requirements first.
Your question is like: I had a meal yesterday and it upset my
stomach. Can you suggest a different meal for today?

I am doing something like this

for key, value in words_list.items():
compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
search = compile.sub(value, content)

where the content is a large text about 500,000 characters and the
word list is about 5,000

Any optimization for the code above?
 
T

Tim Chase

[your reply appears to have come only to me instead of the
mailing list; CC'ing c.l.p in reply]

When you say "This does a replacement for every word in the input corpus
(possibly with itself), but only takes one pass through the source text. "
It sounds great, but I am kinda lost with your code, sorry I am a regex
newbie.

calling statement

results = r.sub(replacer, content)

the replacer is a function that needs parameter. How does the replacer know
what parameter?

The documentation on the .sub() method says that the replacement
can either be some text (as you used) or something callable (in
my case a function, but could be an object with a __call__ method
too), and that this callable is passed the match-object that's found.
why is r = re.compile(r'\b[a-zA-Z]+\b') when the words i want to find should
be in the word_list?

You don't detail what sorts of words are in your word_list.keys()
but you'd want your pattern to match those. My first regexp
(that you quote) matches single words, but rather loosely (thus
my caveat about "replace every word in the input"). The
replacement function checks to see if the replacement is in your
word-list, and does the replacement, otherwise, it just returns
the input. To watch it in action, you can try this:

d = { # keys are all lowercase
'hello': 'goodbye',
'world': 'Python',
}

def replacer(match):
text = match.group(0)
replacement = d.get(text.lower(), text)

# see what we're doing for explanation purposes
print "Replacing %r with %r" % (text, replacement)

return replacement

r = re.compile(r'\b[a-zA-Z]+\b')
print r.sub(replacer, "Hello there world, this is a test")
Is the match here the match of regex or just a variable name?

If the keys in your word_list are more than just words, then the
regexp may not find them all, and thus not replace them all. In
that case you may have to resort to my 2nd regexp which builds
the 5k branch regexp from your actual dictionary keys:

This method on the above dictionary (modified)

d = {
'hello': 'goodbye',
'world': 'python',
'stuff with spaces?': 'tadah!',
}

would create a regexp of

\b(hello|world|stuff\ with\ spaces\?)\b

This has considerable performance implications as len(word_list)
grows, unless you can figure a way to determine that some
replacements are more probable than others and push them to the
front of this regexp, but that's more complex and requires
knowledge of your word-list.

However, if all your keys are simply alpha (or alphanumeric, or
findable by a simple regexp; likely one that doesn't include
whitespace), you'll likely get much better performance with a
generic regexp that over-captures, tries to find a replacement in
your dict, returning that as the replacement; or if it's not in
the dict, returning the original text unchanged. My simple test
would be:

test_regex = r'\w+'
r = re.compile(r'^\b%s\b$' % test_regex)
# added the "^....$" to anchor for testing purposes

for key in word_list: # keys by default
if not r.match(key):
print "Failed to match %r" % key
break

If this passes, then the regexp should likely be sufficient to
capture everything needed to use my replacer() function above.

-tkc
 
P

Patrick Maupin

I am doing something like this

for key, value in words_list.items():
    compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
    search = compile.sub(value, content)

where the content is a large text about 500,000 characters and the
word list is about 5,000

Any optimization for the code above?

Sure.

for key, value in words_list.items():
pass
compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
search = compile.sub(value, content)
 
B

Bryan

Tim said:
James said:
[Tim had written:]
If the keys in your word_list are more than just words, then the
regexp may not find them all, and thus not replace them all.  In
that case you may have to resort to my 2nd regexp which builds
the 5k branch regexp from your actual dictionary keys:

This method on the above dictionary (modified)

   d = {
     'hello': 'goodbye',
     'world': 'python',
     'stuff with spaces?': 'tadah!',
     }

would create a regexp of

   \b(about|world|stuff\ with\ spaces\?)\b


There's a subtle issue of the order of the keywords. Pythons
(ir)regular expressions do not strictly guarantee to find the longest
match. For example,

re.findall(r'\b(about|about\-faced)\b',
'does about-faced match?')

returns ['about'], which is the first complete match, not the longest.
Sorting the keywords into longest-first order avoids the problem.

This has considerable performance implications as len(word_list)
grows, unless you can figure a way to determine that some
replacements are more probable than others and push them to the
front of this regexp, but that's more complex and requires
knowledge of your word-list.

There is another method which is to factor out common prefixes, so the
re module's search-and-backtrack engine never has a choice of multiple
paths. Instead of:

\b(abolished|abolition)\b

we'd use:

\b(aboli(shed|tion)\b

The RE-generator is of course more complex than Tim's one-liner, but I
happen to have code, which I'll include below. It's probably academic,
as I'd agree with Tim that his first solution is the better candidate
at this point.

With my giant prefix-combined RE's, I can re.compile one with 4000
words randomly chosen from a Unix words file, but 5000 results in
"regular expression code size limit exceeded". Tim's version which
doesn't combine prefixes tops out a little lower. This is on 32-bit
Windows, standard distribution. One could, of course, build multiple
smaller RE's, but that starts to suck.

-Bryan Olson


from random import sample
import re


def build_big_re(str_list):
""" Build a non-backtracking regular expression
matching any of the words in str_list.
"""

def trie_insert(table, str):
if str:
submap = table.setdefault(str[0], {})
trie_insert(submap, str[1:])
else:
table[""] = {}

def sequentialize(table, lst):
if table:
keys = table.keys()
keys.sort(key=len, reverse=True)
lst.append('(?:')
seperator = ''
for key in keys:
lst.append(seperator + re.escape(key))
submap = table[key]
while len(submap) == 1:
key = submap.keys()[0]
submap = submap[key]
lst.append(re.escape(key))
sequentialize(submap, lst)
seperator = '|'
lst.append(')')

table = {}
for str in str_list:
trie_insert(table, str)
lst = [r'\b']
sequentialize(table, lst)
lst.append(r'\b')
re_str = "".join(lst)
# print "RE is: '%s'\n" % re_str
return re.compile(re_str, re.IGNORECASE)



def simple_re(str_list):
str_list = sorted(str_list, key=len, reverse=True)
re_str = r'\b(%s)\b' % (
'|'.join(re.escape(s) for s in str_list))
# print "RE is", re_str
return re.compile(re_str, re.IGNORECASE)



def testy():

words_file = r'/usr/share/dict/words' # Common Unix
# words_file = r'C:\z\wordlist.txt' # Just my box

nkeywords = 3500

from random import sample
with open(words_file, 'r') as f:
allwords = [w.strip() for w in f.readlines()]
keywords = sample(allwords, nkeywords)
bigtext = ' '.join(allwords)

print 'Building simple RE...'
simpre = simple_re(keywords)
print 'Run simple re...'
z1 = simpre.findall(bigtext)
print 'Done.\n'

print 'Building complex RE...'
compre = build_big_re(keywords)
print 'Run complex re...'
z2 = compre.findall(bigtext)
print 'Done.'

assert z1 == z2

testy()
 
M

MRAB

Bryan said:
Tim said:
James said:
[Tim had written:]
If the keys in your word_list are more than just words, then the
regexp may not find them all, and thus not replace them all. In
that case you may have to resort to my 2nd regexp which builds
the 5k branch regexp from your actual dictionary keys:
r = re.compile(r'\b(%s)\b' % (
'|'.join(re.escape(s) for s in words_list.keys())
),
re.IGNORECASE)
This method on the above dictionary (modified)

d = {
'hello': 'goodbye',
'world': 'python',
'stuff with spaces?': 'tadah!',
}

would create a regexp of

\b(about|world|stuff\ with\ spaces\?)\b


There's a subtle issue of the order of the keywords. Pythons
(ir)regular expressions do not strictly guarantee to find the longest
match. For example,

re.findall(r'\b(about|about\-faced)\b',
'does about-faced match?')

returns ['about'], which is the first complete match, not the longest.
Sorting the keywords into longest-first order avoids the problem.

This has considerable performance implications as len(word_list)
grows, unless you can figure a way to determine that some
replacements are more probable than others and push them to the
front of this regexp, but that's more complex and requires
knowledge of your word-list.

There is another method which is to factor out common prefixes, so the
re module's search-and-backtrack engine never has a choice of multiple
paths. Instead of:

\b(abolished|abolition)\b

we'd use:

\b(aboli(shed|tion)\b

The RE-generator is of course more complex than Tim's one-liner, but I
happen to have code, which I'll include below. It's probably academic,
as I'd agree with Tim that his first solution is the better candidate
at this point.

With my giant prefix-combined RE's, I can re.compile one with 4000
words randomly chosen from a Unix words file, but 5000 results in
"regular expression code size limit exceeded". Tim's version which
doesn't combine prefixes tops out a little lower. This is on 32-bit
Windows, standard distribution. One could, of course, build multiple
smaller RE's, but that starts to suck.
There's an alternative regex module at:

http://bugs.python.org/issue2636

and also at:

http://pypi.python.org/pypi/regex/0.1.2010414

It looks for common prefixes and suffixes internally and can handle much
larger regexes.
 
L

Lawrence D'Oliveiro

In message
I was working with regex on a very large text, really large but I have
time constrained.

“Fast regex†is a contradiction in terms. You use regexes when you want ease
of definition and application, not speed.

For speed, consider hand-coding your own state machine. Preferably in a
compiled language like C.
 
P

Paul Rubin

Lawrence D'Oliveiro said:
“Fast regex†is a contradiction in terms. You use regexes when you want ease
of definition and application, not speed.

For speed, consider hand-coding your own state machine. Preferably in a
compiled language like C.

But, nothing stops a regexp library from doing that, and some of them do.
 
B

Bryan

Lawrence said:
“Fast regex” is a contradiction in terms. You use
regexes when you want ease of definition and
application, not speed.

Python or Perl regex's are not actually regular expressions. Real
regular expression compilers produce blazing fast results, but they
cannot support many of the features of offered by the search-and-
backtrack engines that Python and Perl use.
For speed, consider hand-coding your own state
machine. Preferably in a compiled language like C.

The speed of a real regular expression engine is hard to beat.

I assume you're not actually suggesting hand-writing a state machine
for the problem at issue here, which requires recognizing about 5000
different words.
 
N

Nobody

“Fast regex†is a contradiction in terms.

Not at all. A properly-written regexp engine will be limited only by
memory bandwidth, provided that the state table fits into the primary
cache.
You use regexes when you
want ease of definition and application, not speed.

Other way around.
For speed, consider hand-coding your own state machine. Preferably in a
compiled language like C.

Or use a decent regexp library.

Even if you want to use non-regular expressions (e.g. backreferences), a
decent engine will still use a DFA, bactracking only where strictly
necessary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top