Generating text from a regular expression

N

Nathan Harmston

Hi everyone,

I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match (to compare
performance of regex against an acora based matcher). Using the
regular expression as a grammar to generate all words in its language.
I was wondering if this possible in Python or possible using anything.
Google doesnt seem to give any obvious answers.

Many thanks in advance,

Nathan
 
G

Grant Edwards

I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match

I was wondering if this possible in Python or possible using
anything. Google doesnt seem to give any obvious answers.

We did this one a couple weeks ago.

It's not possible in the general case (there are an infinite number of
matching words for many/most regular expressions).
 
P

Paul McGuire

Hi everyone,

I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match (to compare
performance of regex against an acora based matcher).

The pyparsing wiki Examples page includes this regex inverter:
http://pyparsing.wikispaces.com/file/view/invRegex.py

From the module header:
# Supports:
# - {n} and {m,n} repetition, but not unbounded + or * repetition
# - ? optional elements
# - [] character ranges
# - () grouping
# - | alternation

-- Paul
 
N

Nathan Harmston

Thanks everyone, the invRegexInf is perfect.

Thanks again,

Nathan

I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match (to compare
performance of regex against an acora based matcher).

The pyparsing wiki Examples page includes this regex inverter:
http://pyparsing.wikispaces.com/file/view/invRegex.py
From the module header:

# Supports:
# - {n} and {m,n} repetition, but not unbounded + or * repetition
# - ? optional elements
# - [] character ranges
# - () grouping
# - | alternation

I took the liberty of modifying your invRegex.py example, adding support
for infinite repeaters. It depends on two other modules:

mergeinf.py (from http://code.activestate.com/recipes/577041) provides the
infinite merge operation.

enumre.py provides the basic functions (merge, prod, repeat, closure)
necessary to enumerate the language generated by a given regular
expression, even if it contains unbounded repeaters like *,+.  The key is
to generate shorter strings before longer ones, so in 'a*|b*' it doesn't
get stuck generating infinite a's before any b.

By example, "(a|bc)*d" corresponds to this code:

     prod(
       closure(
         merge(
           'a',
            prod('b','c'))),
       'd')

which returns an infinite generator starting with:

d
ad
aad
bcd
aaad
abcd
bcad
aaaad
aabcd
abcad
bcaad
bcbcd
aaaaad
aaabcd
aabcad
...


I got the idea from
http://userweb.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf

Finally, invRegexInf.py is based on your original regex parser. I only
modified the generation part, taking advantage of the above
infrastructure; the parser itself remains almost the same. It essentially
saves oneself the very tedious work of converting a regular expression
into the equivalent sequence of function calls as shown above. (I hope I
got it right: I like pyparsing a lot and use it whenever I feel it's
appropriate, but not as often as to remember the details...)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top