Compare regular expressions

  • Thread starter Thomas Dybdahl Ahle
  • Start date
T

Thomas Dybdahl Ahle

Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...
 
D

Diez B. Roggisch

Thomas said:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...

It's not. For the simplest of expressions one might come up with a
relation between them, but even that would be hard. General case? No chance.

Diez
 
K

Karthik Gurusamy

Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...


What you want is finding if R2 is a superset of R1 for two given
regular languages R1 and R2. I know of some methods for finding
intersection of two regular languages; and I think the time/space
complexity is big.

So the simple answer is it is not feasible to provide such support for
two generic r.e.s without a large time/space usage. You may consult
any of the math/theory groups for more insights.

If you know already R2 >= R1 (that is you precompute and remember),
then it's a trivial to skip checking for R1 if R2 turned up negative.
You can even arrange all the Rs in a binary tree like fashion and skip
checking a whole subtree if the sub-tree's root node gave negative for
r.e. match.

Karthik
 
P

Paddy

Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...

you could OR all the individual RE's test them all at once then find
out which matched.

big_re = "|".join( r"(?P<__match_%i__>%s)" % (i, r)
for i,r in enumerate(regexps) )

now if one of the regexps matches then the corresponding named
group should have a non-empty string value.

- Paddy.
 
A

Adam Atlas

It's not. For the simplest of expressions one might come up with a
relation between them, but even that would be hard. General case? No chance.

I wouldn't say there's 'no chance'. It would require external parsing,
for sure, but if anything, it may only be impossible for unusual
cases.

(Maybe I'll try this the next time I feel like writing parsers for fun
(which is surprisingly frequently).)
 
D

Diez B. Roggisch

Adam said:
I wouldn't say there's 'no chance'. It would require external parsing,
for sure, but if anything, it may only be impossible for unusual
cases.

The external parsing of regular expressions is cheap. But once you are
there, how do you proceed? That's where it becomes hard. I can think of
some simple strategies, but nothing that can't be easily found a
counter-example for.

Diez
 
D

Diez B. Roggisch

Paddy said:
you could OR all the individual RE's test them all at once then find
out which matched.

big_re = "|".join( r"(?P<__match_%i__>%s)" % (i, r)
for i,r in enumerate(regexps) )

now if one of the regexps matches then the corresponding named
group should have a non-empty string value.

This doesn't answer the question if two of the sub-expressions matched.

Diez
 
T

Thomas Dybdahl Ahle

Den Mon, 16 Apr 2007 11:50:40 +0200 skrev Thomas Dybdahl Ahle:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a
line, as most of the time, one of them having a match means none of the
others can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in: "^strange line (\w+) and (\w+)$"
and "^strange line (\w+) (?:.*?)$" in which case I'd like to first test
the seccond and only if it mathces test the seccond.

Do anybody know if such a test is possible? if exp0.contains(exp1): ...

I found this link: http://terpstra.ca/compare.html which seams to compare
expressions. Not python expressions though.
Sadly it writes nothing about the way it does the thing, and if it will
always work.
 
P

Paddy

Paddy schrieb:







This doesn't answer the question if two of the sub-expressions matched.

Diez

Or three, or four...
If the frequencies of occurence are right and the OP wants all, then
he could use the above to get any then follow up with a search for
all
the rest to the right of the matching regexp portion to get any more.
But thats complex. Better to just do individual matches in a loop
I'd think.

- Paddy.
 
T

Thomas Dybdahl Ahle

Den Tue, 17 Apr 2007 11:59:15 -0700 skrev Paddy:
Hm, if I were to OR them all the time, then I wouldn't get any boost from
compiling.

I'll probably just stick with a try them all solution, and then change it
if I run into something that does the right thing. I believe it can make
the code something like 7 or 10 times faster.
 
C

Charles Sanders

Thomas said:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

Not what you want, and would be a LOT of work, but if I
remember correctly (from university more than 20 years ago)
there is an algorithm that could be adapted to return a list
of all the regular expressions that match a string.

I thing the base algorithms were documented in Aho and Ullman
("The Dragon book" if I remember correctly). There was one
for converting a regular expression into a Non-deterministic
Finite-state Automaton, and another for converting the NFA
to a Deterministic Finite-state Automaton. The book mentioned
that this also handles multiple regular expressions which can
be treated as the sub-expressions combined using the or operator.
Other, newer books on compiler design would probably contain
similar information in their sections on lexical analysis.

The algorithm given (if my memory is correct) only handled the
case where a true/false match/no-match result was required, but
mentioned how to generalise it to return which of several
expressions separated by or operators was matched. I think
an additional generalisation to return a set of matches rather
than one would be relatively straight-forward.

I believe these algorithms are the basis of lex/flex and
similar utilities, as well as the regular expression handling
of many languages.

Of course, I would like to emphasise that all this is based on
memory of university courses more than 20 years ago, so the
details could be wrong.

Charles
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top