88k regex = RuntimeError

J

jodawi

I need to find a bunch of C function declarations by searching
thousands of source or html files for thousands of known function
names. My initial simple approach was to do this:

rxAllSupported = re.compile(r"\b(" + "|".join(gAllSupported) + r")\b")
# giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b

for root, dirs, files in os.walk( ... ):
....
for fileName in files:
....
filePath = os.path.join(root, fileName)
file = open(filePath, "r")
contents = file.read()
....
result = re.search(rxAllSupported, contents)

but this happens:

result = re.search(rxAllSupported, contents)
File "C:\Python24\Lib\sre.py", line 134, in search
return _compile(pattern, flags).search(string)
RuntimeError: internal error in regular expression engine

I assume it's hitting some limit, but don't know where the limit is to
remove it. I tried stepping into it repeatedly with Komodo, but didn't
see the problem.

Suggestions?
 
D

Diez B. Roggisch

I assume it's hitting some limit, but don't know where the limit is to
remove it. I tried stepping into it repeatedly with Komodo, but didn't
see the problem.

That's because it is buried in the C-library that is the actual
implementation. There has been a discussion about this a few weeks ago -
and AFAIK there isn't much you can do about that.
Suggestions?

Yes. Don't do it :) After all, what you do is nothing but a simple
word-search. If I had that problem, my naive approach would be to simply
tokenize the sources and look for the words in them being part of your
function-name-set. A bit of statekeeping to keep track of the position, and
you're done. Check out pyparsing, it might help you doing the tokenization.


I admit that the apparent ease of the regular expression would have lured me
into the same trap.

Diez
 
T

Tim N. van der Leeuw

Why don't you create a regex that finds for you all C function
declarations (and which returns you the function-names); apply
re.findall() to all files with that regex; and then check those
funtion-names against the set of allSupported?

You might even be able to find a regex for C funtion declarations on
the web.

Your gAllSupported can be a set(); you can then create the intersection
between gAllSupported and the function-names found by your regex.

Cheers,

--Tim
 
P

Peter Otten

jodawi said:
I need to find a bunch of C function declarations by searching
thousands of source or html files for thousands of known function
names. My initial simple approach was to do this:

rxAllSupported = re.compile(r"\b(" + "|".join(gAllSupported) + r")\b")
# giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b

for root, dirs, files in os.walk( ... ):
...
for fileName in files:
...
filePath = os.path.join(root, fileName)
file = open(filePath, "r")
contents = file.read()
...
result = re.search(rxAllSupported, contents)

but this happens:

result = re.search(rxAllSupported, contents)
File "C:\Python24\Lib\sre.py", line 134, in search
return _compile(pattern, flags).search(string)
RuntimeError: internal error in regular expression engine

I assume it's hitting some limit, but don't know where the limit is to
remove it. I tried stepping into it repeatedly with Komodo, but didn't
see the problem.

Suggestions?

One workaround may be as easy as

wanted = set(["foo", "bar", "baz"])
file_content = "foo bar-baz ignored foo()"

r = re.compile(r"\w+")
found = [name for name in r.findall(file_content) if name in wanted]

print found

Peter
 
K

Kent Johnson

jodawi said:
I need to find a bunch of C function declarations by searching
thousands of source or html files for thousands of known function
names. My initial simple approach was to do this:

rxAllSupported = re.compile(r"\b(" + "|".join(gAllSupported) + r")\b")
# giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b

Maybe you can be more clever about the regex? If the names above are
representative then something like r'\b(\w{1,2})Foo\b' might work.
 
T

Tim N. van der Leeuw

This is basically the same idea as what I tried to describe in my
previous post but without any samples.
I wonder if it's more efficient to create a new list using a
list-comprehension, and checking each entry against the 'wanted' set,
or to create a new set which is the intersection of set 'wanted' and
the iterable of all matches...

Your sample code would then look like this:
import re
r = re.compile(r"\w+")
file_content = "foo bar-baz ignored foo()"
wanted = set(["foo", "bar", "baz"])
found = wanted.intersection(name for name in r.findall(file_content))
print found set(['baz', 'foo', 'bar'])

Anyone who has an idea what is faster? (This dataset is so limited that
it doesn't make sense to do any performance-tests with it)

Cheers,

--Tim
 
P

Peter Otten

Tim said:
This is basically the same idea as what I tried to describe in my
previous post but without any samples.
I wonder if it's more efficient to create a new list using a
list-comprehension, and checking each entry against the 'wanted' set,
or to create a new set which is the intersection of set 'wanted' and
the iterable of all matches...

Your sample code would then look like this:
import re
r = re.compile(r"\w+")
file_content = "foo bar-baz ignored foo()"
wanted = set(["foo", "bar", "baz"])
found = wanted.intersection(name for name in r.findall(file_content))

Just

found = wanted.intersection(r.findall(file_content))
print found set(['baz', 'foo', 'bar'])

Anyone who has an idea what is faster? (This dataset is so limited that
it doesn't make sense to do any performance-tests with it)

I guess that your approach would be a bit faster though most of the time
will be spent on IO anyway. The result would be slightly different, and
again yours (without duplicates) seems more useful.

However, I'm not sure whether the OP would rather stop at the first match or
need a match object and not just the text. In that case:

matches = (m for m in r.finditer(file_content) if m.group(0) in wanted)

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top