Enumerating Regular Expressions

  • Thread starter blair.bethwaite
  • Start date
B

blair.bethwaite

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Cheers,
-Blair
 
J

James Stroud

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Cheers,
-Blair

You mean like re.compile(r'.*') ?



--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
J

James Stroud

No. I mean like:


a
b

You see the difficulty don't you? How will the computer know in advance
that the regex matches only a finite set of possible strings?

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
M

Michael J. Fromberger

James Stroud said:
You see the difficulty don't you? How will the computer know in advance
that the regex matches only a finite set of possible strings?

You don't. Hence, you want something that behaves like a generator, and
will produce the strings one at a time. Preferably, for the purposes of
useful computation, in some canonical order.

I'm sorry to say I don't know of an existing Python module to do this,
although you could write one for at least the basic regular expression
operators if you wanted. The basic problem isn't all that hard to
solve, though the full generality of the re module's input syntax is far
more expressive than truly "regular" expressions from language theory.

Cheers,
-M
 
B

blair.bethwaite

James said:
You see the difficulty don't you? How will the computer know in advance
that the regex matches only a finite set of possible strings?

Well sure it might be a little difficult to figure _that_ out, although
probably not all that hard if you converted to an FSA or something. I
imagine detecting looping would be as simple as applying the right
graph algorithm.

But that's not the point, you don't strictly need to know in advance
whether the regex represents a finite or infinite language. I just
want to enumerate it, if it's going to take longer than the life of the
universe and a stack size to match to do it, then so be it. It's
surely up to the user to be careful about how they form their
expressions. One way to get around it would be to disallow the *
operator in enumerations.

Cheers,
-Blair
 
B

blair.bethwaite

Michael said:
You don't. Hence, you want something that behaves like a generator, and
will produce the strings one at a time. Preferably, for the purposes of
useful computation, in some canonical order.

Precisely, probably in order of length and value.
I'm sorry to say I don't know of an existing Python module to do this,
although you could write one for at least the basic regular expression
operators if you wanted. The basic problem isn't all that hard to
solve, though the full generality of the re module's input syntax is far
more expressive than truly "regular" expressions from language theory.

I thought that was the case, I've found a paper on the topic at least.
Maybe once I've finished some other work I'll give it a shot. It seems
like a fairly useful thing to be able to do with a regular expression
so I just guessed that somebody must have done it already. Perhaps I
can find it for another language. I'm not so sure I fancy the idea of
digging around in the re module in order to add the functionality
there...

-Blair
 
D

Dale Strickland-Clark

Any regular expression that has an asterisk in it has an infinite number of
possible matches.

If it has two asterisks, that's an infinite number squared and that's a
really big number.

You wouldn't want to print them out.
 
B

blair.bethwaite

Dale said:
Any regular expression that has an asterisk in it has an infinite number of
possible matches.

If it has two asterisks, that's an infinite number squared and that's a
really big number.

You wouldn't want to print them out.

We've been over this already. Why are people getting stuck on infinite
regular languages? I've made it quite clear that I'm only really
interested in doing this for finite languages, but that shouldn't
matter anyway. Maybe I should phrase it differently; I want a tool
that can enumerate a regex, with support for generating each string
described by the regex in some predefined order. So something like:
# pattern is the regex representing language L
reg_lang_gen = re.compile('pattern').enumerator()
while len(s) < n:
s = reg_lang_gen.next() # get next string in L
...

It's easy to imagine a problem where you'd want to enumerate a regex
that represents an infinite regular language. E.g. a particularly dumb
bruteforce password cracker, you might want to keep generating strings
from the language of choice until you find one that matches your
encrpyted version, naturally you might want to stop if you hadn't found
it within 10 characters or so.

Cheers,
-B
 
D

Diez B. Roggisch

I thought that was the case, I've found a paper on the topic at least.
Maybe once I've finished some other work I'll give it a shot. It seems
like a fairly useful thing to be able to do with a regular expression
so I just guessed that somebody must have done it already.

Just wandering: whatfor do you perceive it useful? I've been done quite a
few things with rexes - yet never it occured to me that I'd be in need of
enumeration all the gazillion of possible matches. YMMV - so what for?

Regards,

Diez
 
T

Tim Chase

Why are people getting stuck on infinite regular
languages? I've made it quite clear that I'm only really
interested in doing this for finite languages, but that
shouldn't matter anyway.

The power of regular expressions is that they define a
consice means to encapsulate an infinite number of inputs.
If you're not tapping this power, the general wisdom is to
not use regular expressions.

That said, unless you can ensure you don't have an infinite
grammer (in particular, no asterisks or plus-signs or
unbounded ranges like "{4,}"), you then have the problem of
how to navigate that infinite set.

Taking for example the most common/simple regexp:

".*"

That matches any and every string. There's a whole library
of Congress for which every written work will match that
string. It sounds a lot like Jorge Luis Borges's 1956 "The
Library of Babel".

Do you begin searching depth-first with "a" and then "aa"
and then "aaa"? Or do you begin breadth-first with "a",
then "b"...then "z", then "aa", "ab", ...? Depth-first is
far more efficient when searching a branching set of data,
but when you have infinite depth, it will take a while ;)
Alternatively, you can use some sort of iterative-deepening
search which puts an increasing bound on how deep the
depth-first search will go before giving up for that
particular iteration. Googling on "iteritave deepening"
should bring back some ideas for that one.

If you truely have a finite grammer, it would be feasible to
write some sort of parser, ugly as it might be. As a first
pass, I'd have my parser emit python code to write N nested
loops to try each combination of each element, so something like

"[a-zA-Z_][a-zA-Z0-9_]"

became something like

def tries():
s1 = set(range(ord('a'),ord('z')+1))
s1.add(range(ord('A'),ord('Z')+1))
s1.add(ord('_'))
s2 = set(range(ord('a'),ord('z')+1))
s2.add(range(ord('A'),ord('Z')+1))
s2.add(range(ord('0'),ord('9')+1))
s2.add(ord('_'))
for a1 in s1:
for a2 in s2:
yield "%s%s" % (chr(a1), chr(a2))

Thus, the input regexp would generate python code that you
could then include in your project. Ugly, but a possible
solution. There are lots of peculiarities though that would
need to be handled. Branching, escaped characters in
character sets (well, escaping in general is a headache).

An alternative might be brute-forcing your regexp:

for a1 in range(0,256):
for a2 in range(0,256):
s = "%s%s" % (chr(a1),chr(a2))
m = re.match(s)
if m:
yield s

nested to however deep you're interested in.

All will be slow. All are pretty ugly.

-tkc
 
B

Boris Borcic

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Cheers,
-Blair

By hand write down a generator that will solve the simplest case of '.*' as a
regexp, and filter the output of that by the given regexp ?

- seriously, that's an interesting question, thanks for asking it ;)
 
M

Mirco Wahab

Hi blair.bethwaite
I want a tool that can enumerate a regex,
with support for generating each string
described by the regex in some predefined order.

If you run the regex against some target
string, this is gonna be easy (but maybe
not what you want).

If you have the string 'Python' and run
a regex pattern \w+(greedy!) against it,
the result should read:

match:python
match:pytho
match:pyth
match:pyt
match:py
match:p
match:.ython
match:.ytho
match:.yth
match:.yt
match:.y
match:..thon
match:..tho
match:..th
match:..t
match:...hon
match:...ho
match:...h
match:....on
match:....o
match:.....n

These are then your strings resulting from "\w+"
This can be extracted by implanting code into
the regex, it generates the strings for you then.

I dont exaclty how to do this in Python,
but in the 'Gem-Liar' you would do a simple:

$re = qr/
\w+ ## <-- insert your regex code here
(?{print "match:",'.'x$-[0],$&,"\n"}) (?!) ## this prints it
/x;

$_ = "Python!"; ## against which string
1 while /$re/g; ## print them out


This should somehow work in Python
too, but my skills aren't up to the
task ;-)

Regards

M.
 
K

Kent Johnson

Hi all,

Does anybody know of a module that allows you to enumerate all the
strings a particular regular expression describes?

Make a generator that yields *all* strings in your chosen alphabet (see
the monthly threads about permutations and combinations for hints).
Filter with the regex. Halting is left as an exercise for the reader.
(Halting when the length reaches a predetermined limit would be one way
to do it.)

Kent
 
K

Karthik Gurusamy

Well sure it might be a little difficult to figure _that_ out, although
probably not all that hard if you converted to an FSA or something. I
imagine detecting looping would be as simple as applying the right
graph algorithm.

That's right. For finite regular language, you don't allow cycles. That
means the graph must be a DAG (directed acyclic graph. If not directed
it must be a tree, simple to check as node-count is edge-count plus
one).

Once you have a DAG, the problem becomes enumerating all paths from
root node to any final node. This is a pretty straighforward problem in
graph theory. I think you can simply find all from root-node to any
other node and discard any of the path ending in a non-final state
node.

Karthik
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,294
Messages
2,571,511
Members
48,213
Latest member
DonnellTol

Latest Threads

Top