Enumerating Regular Expressions

blair.bethwaite · May 9, 2006

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Cheers,
-Blair

James Stroud · May 9, 2006

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Cheers,
-Blair

You mean like re.compile(r'.*') ?

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

blair.bethwaite · May 9, 2006

James said:
You mean like re.compile(r'.*') ?

No. I mean like:

James Stroud · May 9, 2006

No. I mean like:

a
b

You see the difficulty don't you? How will the computer know in advance
that the regex matches only a finite set of possible strings?

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Michael J. Fromberger · May 9, 2006

James Stroud said:
You see the difficulty don't you? How will the computer know in advance
that the regex matches only a finite set of possible strings?

You don't. Hence, you want something that behaves like a generator, and
will produce the strings one at a time. Preferably, for the purposes of
useful computation, in some canonical order.

I'm sorry to say I don't know of an existing Python module to do this,
although you could write one for at least the basic regular expression
operators if you wanted. The basic problem isn't all that hard to
solve, though the full generality of the re module's input syntax is far
more expressive than truly "regular" expressions from language theory.

Cheers,
-M

blair.bethwaite · May 9, 2006

James said:
You see the difficulty don't you? How will the computer know in advance
that the regex matches only a finite set of possible strings?

Well sure it might be a little difficult to figure _that_ out, although
probably not all that hard if you converted to an FSA or something. I
imagine detecting looping would be as simple as applying the right
graph algorithm.

But that's not the point, you don't strictly need to know in advance
whether the regex represents a finite or infinite language. I just
want to enumerate it, if it's going to take longer than the life of the
universe and a stack size to match to do it, then so be it. It's
surely up to the user to be careful about how they form their
expressions. One way to get around it would be to disallow the *
operator in enumerations.

Cheers,
-Blair

blair.bethwaite · May 9, 2006

Michael said:
You don't. Hence, you want something that behaves like a generator, and
will produce the strings one at a time. Preferably, for the purposes of
useful computation, in some canonical order.

Precisely, probably in order of length and value.

I'm sorry to say I don't know of an existing Python module to do this,
although you could write one for at least the basic regular expression
operators if you wanted. The basic problem isn't all that hard to
solve, though the full generality of the re module's input syntax is far
more expressive than truly "regular" expressions from language theory.

I thought that was the case, I've found a paper on the topic at least.
Maybe once I've finished some other work I'll give it a shot. It seems
like a fairly useful thing to be able to do with a regular expression
so I just guessed that somebody must have done it already. Perhaps I
can find it for another language. I'm not so sure I fancy the idea of
digging around in the re module in order to add the functionality
there...

-Blair

Dale Strickland-Clark · May 9, 2006

Any regular expression that has an asterisk in it has an infinite number of
possible matches.

If it has two asterisks, that's an infinite number squared and that's a
really big number.

You wouldn't want to print them out.

blair.bethwaite · May 9, 2006

Dale said:
Any regular expression that has an asterisk in it has an infinite number of
possible matches.

If it has two asterisks, that's an infinite number squared and that's a
really big number.

You wouldn't want to print them out.

We've been over this already. Why are people getting stuck on infinite
regular languages? I've made it quite clear that I'm only really
interested in doing this for finite languages, but that shouldn't
matter anyway. Maybe I should phrase it differently; I want a tool
that can enumerate a regex, with support for generating each string
described by the regex in some predefined order. So something like:
# pattern is the regex representing language L
reg_lang_gen = re.compile('pattern').enumerator()
while len(s) < n:
s = reg_lang_gen.next() # get next string in L
...

It's easy to imagine a problem where you'd want to enumerate a regex
that represents an infinite regular language. E.g. a particularly dumb
bruteforce password cracker, you might want to keep generating strings
from the language of choice until you find one that matches your
encrpyted version, naturally you might want to stop if you hadn't found
it within 10 characters or so.

Cheers,
-B

Diez B. Roggisch · May 9, 2006

I thought that was the case, I've found a paper on the topic at least.

Maybe once I've finished some other work I'll give it a shot. It seems
like a fairly useful thing to be able to do with a regular expression
so I just guessed that somebody must have done it already.

Just wandering: whatfor do you perceive it useful? I've been done quite a
few things with rexes - yet never it occured to me that I'd be in need of
enumeration all the gazillion of possible matches. YMMV - so what for?

Regards,

Diez

Tim Chase · May 9, 2006

Why are people getting stuck on infinite regular

languages? I've made it quite clear that I'm only really
interested in doing this for finite languages, but that
shouldn't matter anyway.

The power of regular expressions is that they define a
consice means to encapsulate an infinite number of inputs.
If you're not tapping this power, the general wisdom is to
not use regular expressions.

That said, unless you can ensure you don't have an infinite
grammer (in particular, no asterisks or plus-signs or
unbounded ranges like "{4,}"), you then have the problem of
how to navigate that infinite set.

Taking for example the most common/simple regexp:

".*"

That matches any and every string. There's a whole library
of Congress for which every written work will match that
string. It sounds a lot like Jorge Luis Borges's 1956 "The
Library of Babel".

Do you begin searching depth-first with "a" and then "aa"
and then "aaa"? Or do you begin breadth-first with "a",
then "b"...then "z", then "aa", "ab", ...? Depth-first is
far more efficient when searching a branching set of data,
but when you have infinite depth, it will take a while

Alternatively, you can use some sort of iterative-deepening
search which puts an increasing bound on how deep the
depth-first search will go before giving up for that
particular iteration. Googling on "iteritave deepening"
should bring back some ideas for that one.

If you truely have a finite grammer, it would be feasible to
write some sort of parser, ugly as it might be. As a first
pass, I'd have my parser emit python code to write N nested
loops to try each combination of each element, so something like

"[a-zA-Z_][a-zA-Z0-9_]"

became something like

def tries():
s1 = set(range(ord('a'),ord('z')+1))
s1.add(range(ord('A'),ord('Z')+1))
s1.add(ord('_'))
s2 = set(range(ord('a'),ord('z')+1))
s2.add(range(ord('A'),ord('Z')+1))
s2.add(range(ord('0'),ord('9')+1))
s2.add(ord('_'))
for a1 in s1:
for a2 in s2:
yield "%s%s" % (chr(a1), chr(a2))

Thus, the input regexp would generate python code that you
could then include in your project. Ugly, but a possible
solution. There are lots of peculiarities though that would
need to be handled. Branching, escaped characters in
character sets (well, escaping in general is a headache).

An alternative might be brute-forcing your regexp:

for a1 in range(0,256):
for a2 in range(0,256):
s = "%s%s" % (chr(a1),chr(a2))
m = re.match(s)
if m:
yield s

nested to however deep you're interested in.

All will be slow. All are pretty ugly.

-tkc

Boris Borcic · May 9, 2006

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Cheers,
-Blair

By hand write down a generator that will solve the simplest case of '.*' as a
regexp, and filter the output of that by the given regexp ?

- seriously, that's an interesting question, thanks for asking it

Mirco Wahab · May 9, 2006

Hi blair.bethwaite

I want a tool that can enumerate a regex,
with support for generating each string
described by the regex in some predefined order.

If you run the regex against some target
string, this is gonna be easy (but maybe
not what you want).

If you have the string 'Python' and run
a regex pattern \w+(greedy!) against it,
the result should read:

match

ython
match

ytho
match

yth
match

yt
match

y
match

match:.ython
match:.ytho
match:.yth
match:.yt
match:.y
match:..thon
match:..tho
match:..th
match:..t
match:...hon
match:...ho
match:...h
match:....on
match:....o
match:.....n

These are then your strings resulting from "\w+"
This can be extracted by implanting code into
the regex, it generates the strings for you then.

I dont exaclty how to do this in Python,
but in the 'Gem-Liar' you would do a simple:

$re = qr/
\w+ ## <-- insert your regex code here
(?{print "match:",'.'x$-[0],$&,"\n"}) (?!) ## this prints it
/x;

$_ = "Python!"; ## against which string
1 while /$re/g; ## print them out

This should somehow work in Python
too, but my skills aren't up to the
task ;-)

Regards

M.

Kent Johnson · May 9, 2006

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Make a generator that yields *all* strings in your chosen alphabet (see
the monthly threads about permutations and combinations for hints).
Filter with the regex. Halting is left as an exercise for the reader.
(Halting when the length reaches a predetermined limit would be one way
to do it.)

Kent

Karthik Gurusamy · May 10, 2006

Well sure it might be a little difficult to figure _that_ out, although
probably not all that hard if you converted to an FSA or something. I
imagine detecting looping would be as simple as applying the right
graph algorithm.

That's right. For finite regular language, you don't allow cycles. That
means the graph must be a DAG (directed acyclic graph. If not directed
it must be a tree, simple to check as node-count is edge-count plus
one).

Once you have a DAG, the problem becomes enumerating all paths from
root node to any final node. This is a pretty straighforward problem in
graph theory. I think you can simply find all from root-node to any
other node and discard any of the path ending in a non-final state
node.

Karthik

Utility to locate errors in regular expressions	3	May 24, 2013
JavaScript Challenge: Validating Email Addresses	1	Oct 6, 2023
Large regular expressions	1	Mar 15, 2010
How to print all expressions that match a regular expression	23	Feb 6, 2010
Trouble with regular expressions	6	Feb 7, 2009
The power of regular expressions without regular expressions.	0	Jul 17, 2013
Regular expressions	8	Dec 26, 2011
Use Regular Expressions to extract URL's	3	Apr 30, 2010

Enumerating Regular Expressions

blair.bethwaite

James Stroud

blair.bethwaite

James Stroud

Michael J. Fromberger

blair.bethwaite

blair.bethwaite

Dale Strickland-Clark

blair.bethwaite

Diez B. Roggisch

Tim Chase

Boris Borcic

Mirco Wahab

Kent Johnson

Karthik Gurusamy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads