Programming challenge: wildcard exclusion in cartesian products

W

wkehowski

"And it doesn't really matter what language you use to implement this
algorithm, it's the idea that counts. Notation aside, all
implementations will be quite similar."

I'll guess I'll get out my Turing tape. ;)

What are some good references for finite state machines? Minimization?
 
D

Dirk Thierbach

What are some good references for finite state machines? Minimization?

A classic is "Introduction to automata theory, languages and computation"
by Hopcroft and Ullman. But any other book about finite state machines
should cover these topics, too. There are good chances that you can just
google for a detailed explanation.

- Dirk
 
W

wkehowski

Hello,

The solution that would have the most utility would be one where the
elements are generated one-by-one, loop-like, so that they can be used
in the body of a loop, and to avoid the fact that even with exclusion
the cardinality of the target set EX^n could be in the millions even
with a full list of wc's, that is, a list containing at least one wc of
every length in 2..(n-1). I don't know enough Lisp, Haskell or
Qi/Prolog to know if the solutions so far can be modified to do this.
The Python program is too slow for large sets.

Walter Kehowski
 
D

Dinko Tenev

Dirk said:
If more time during preprocessing is allowed, another idea is to
treat the wildcard expressions as regular expressions, convert
each into a finite state machine, construct the "intersection" of
all these state machines, minimize it and then swap final and non-final
states.

Given the requirements, did you mean taking the *union* and swapping
states? Or maybe swapping states first, and then taking the
intersection?
Then you can use the resulting automaton to efficiently
enumerate S^n - W. In the above case, the resulting FSM would have just
three states.

I don't see immediately how exactly this is going to work. Unless I'm
very much mistaken, a FSA in the classical sense will accept or reject
only after the whole sequence has been consumed, and this spells
exponential times. For improved asymptotic complexity in this case,
you need to be able to at least reject in mid-sequence, and that calls
for a slightly different concept of a FSA -- is this what you meant?

Cheers,

Dinko
 
D

Dr.Ruud

(e-mail address removed) schreef:
The solution that would have the most utility would be one where the
elements are generated one-by-one, loop-like, so that they can be used
in the body of a loop, and to avoid the fact that even with exclusion
the cardinality of the target set EX^n could be in the millions even
with a full list of wc's, that is, a list containing at least one wc
of every length in 2..(n-1). I don't know enough Lisp, Haskell or
Qi/Prolog to know if the solutions so far can be modified to do this.
The Python program is too slow for large sets.

Use a bitmapping, see also

Detect the exclusions with a bitwise AND.
 
D

Dirk Thierbach

Given the requirements, did you mean taking the *union* and swapping
states? Or maybe swapping states first, and then taking the
intersection?

Whatever the requirements were. Take your pick. :)
I don't see immediately how exactly this is going to work. Unless I'm
very much mistaken, a FSA in the classical sense will accept or reject
only after the whole sequence has been consumed, and this spells
exponential times.

Exponential in respect to what? You just recursively walk the
automaton for every word of length n, so at most you'll have to check
every word in S^n once. Hence, the complexity is not worse than the
"generate and test" approach (it's in fact better, since testing is
trivial).

However, if the result set is simple (as in the example), then the
result FSM will have states that won't have transitions for every
letters, so I guess the average case will be a lot better.
For improved asymptotic complexity in this case,
you need to be able to at least reject in mid-sequence,

One can do that if there is no valid transition out of some state.

I guess one could even optimize on that: In the minimal automaton,
every state has some transition sequence that will end up in a final
state. Annotate each state with the minimum number of steps for
such a sequence. While walking the automaton, keep the maximum
number of letters to produce in a variable. If this number is small
then the number in the annotation, stop and backtrace.

This should guarantee that only those states are visited that really
produce some output.

- Dirk
 
D

Dinko Tenev

Dirk said:
Whatever the requirements were. Take your pick. :)

OK :)
Exponential in respect to what?

With respect to sequence length. In the example above (S = {a, b}, W =
{*a*b, *b*a}) you have to enumerate |S^n| potential sequences to get to
the couple (a^n, b^n) that are of particular interest.
You just recursively walk the
automaton for every word of length n, so at most you'll have to check
every word in S^n once.

That's right -- the sky is the limit ;)
Hence, the complexity is not worse than the
"generate and test" approach (it's in fact better, since testing is
trivial).

Testing per sequence will be faster, yes, but the overall running time
will still be a disaster, just like with the "generate and test"
solutions. The Python program tries to do better than that, and it
succeeds some of the time.
However, if the result set is simple (as in the example), then the
result FSM will have states that won't have transitions for every
letters, so I guess the average case will be a lot better.

I don't believe this case should ever arise out of the current
definition of the problem: labels are specified explicitly only for the
excluded subset, and you have to accept everything else by default.
One can do that if there is no valid transition out of some state.

One possibly can, but such states should never be encountered (see
above.)

Looking at the minimal DFA for the above example, it may be more
appropriate to detect being in a state from which there's no path to a
final state:

S: a -> A, b -> B
A: a -> A, b -> F
B: a -> F, b -> B
F: a -> F, b -> F

Here, S is the starting state, {S, A, B} are final, and F is non-final
(after swapping.) Obviously, the smart thing to do is to bail out as
soon as you're in F. The point is, though, are things guaranteed to be
as simple in the general case? :)
I guess one could even optimize on that: In the minimal automaton,
every state has some transition sequence that will end up in a final
state. Annotate each state with the minimum number of steps for
such a sequence. While walking the automaton, keep the maximum
number of letters to produce in a variable. If this number is small
then the number in the annotation, stop and backtrace.

I don't think this could cut efficiently for patterns like *a*b.


Cheers,

Dinko
 
A

Aaron Denney

The solution that would have the most utility would be one where the
elements are generated one-by-one, loop-like, so that they can be used
in the body of a loop, and to avoid the fact that even with exclusion
the cardinality of the target set EX^n could be in the millions even
with a full list of wc's, that is, a list containing at least one wc of
every length in 2..(n-1). I don't know enough Lisp, Haskell or
Qi/Prolog to know if the solutions so far can be modified to do this.
The Python program is too slow for large sets.

In Haskell you can get this essentially for free, due to its laziness.
 
W

wkehowski

Call a wc 'free' if it satisfies the propery that every letter 'a' in
it appears only in the form '*a*', and 'anchored' otherwise. What if
all wc's are free? How does this affect the DFA? Does it minimize
nontrivially? Keep in mind I'm new to DFA theory.

Walter Kehowski
 
D

Dirk Thierbach

With respect to sequence length.

But you cannot get rid of this. Consider S = {a, b}, W = {a}.
Then there are |S|^n result elements for n > 1, and you have to enumerate
all of them.

The best thing one can hope for is to just actually process those
elements that are really in the result set. With the FSM, you're
getting at least close to that.
I don't believe this case should ever arise out of the current
definition of the problem: labels are specified explicitly only for the
excluded subset, and you have to accept everything else by default.

If the result set is {a^n, b^n | n \in N}, then you have an automaton
where exactly this happens. Since minimum automatons are unique
up to renaming of states, that's the result you'll get.
One possibly can, but such states should never be encountered (see
above.)

The automaton is:

S: a -> A, b -> B
A: a -> A
B: b -> B

All the states are final. A and B have just one transition, so
you'll be able to generate either a^n or b^n efficiently.
Looking at the minimal DFA for the above example, it may be more
appropriate to detect being in a state from which there's no path to a
final state:

This cannot happen, because minimizing removes all those states.
(Or, in case you have a different definition of automaton in mind:
There'll be just one "stuck" state, where all transitions that never
go to a final state end up. Remove this state, and you'll end up
with the other definition).
I don't think this could cut efficiently for patterns like *a*b.

The point is not to "cut efficiently", the point is to enumerate
only those words that are actually in the result set. If you are
enumerating all words shorter than a given length, the above method
should guarantee this.

- Dirk
 
D

Dinko Tenev

Dirk said:
But you cannot get rid of this. Consider S = {a, b}, W = {a}.
Then there are |S|^n result elements for n > 1, and you have to enumerate
all of them.

Yes, but then, they are in the target set. The point here is whether
you can generate S^n - W in Theta( n * |S^n - W| ), which may be
dramatically different from Theta( n * |S^n| ).
The best thing one can hope for is to just actually process those
elements that are really in the result set. With the FSM, you're
getting at least close to that.

If the FSA has to consume the whole sequence, you're only impoving
performace by a constant factor.
The automaton is:

S: a -> A, b -> B
A: a -> A
B: b -> B

The target set is specified as S^n - W, where W is everything matching
(.*a.*b|.*b.*a). Following the construction procedure in point, this
exclusion set is matched exactly by my DFA with S initial and F final.
Then, swapping final and non-final states makes {S, A, B} final, and F
non-final. Your DFA above may be equivalent, but to me it is far from
clear exactly what algorithm would build it from the given data.
This cannot happen, because minimizing removes all those states.
(Or, in case you have a different definition of automaton in mind:
There'll be just one "stuck" state, where all transitions that never
go to a final state end up. Remove this state, and you'll end up
with the other definition).

I have to admit that I've never considered this before, but as I'm
thinking of it now, a block of such states will never split by
minimization, so they're bound to end up together as a single state,
which can then be eliminated. I can see your point now.
The point is not to "cut efficiently", the point is to enumerate
only those words that are actually in the result set.

No, this is not the point. Naive filtering already does that. By
"cutting efficiently" I mean skipping over search sub-trees that don't
contain any results from the target set. If you can do this, you can
enumerate the target set in time proportional to its size, not to the
size of the search space, which is the point.


Cheers,

Dinko
 
D

Dinko Tenev

Call a wc 'free' if it satisfies the propery that every letter 'a' in
it appears only in the form '*a*', and 'anchored' otherwise. What if
all wc's are free? How does this affect the DFA? Does it minimize
nontrivially? Keep in mind I'm new to DFA theory.

There would be no difference for single patterns, but I'm not sure into
how large a DFA a set of those would combine.


Cheers,

Dinko
 
D

Dirk Thierbach

Dinko Tenev said:
Dirk Thierbach wrote:

[One cannot escape exponential behaviour]
Yes, but then, they are in the target set.

Which is the point. If they are in the target set, you have to enumerate
them. If the target set is of exponential size with respect to n,
then you'll need exponential time to do that.
The point here is whether
you can generate S^n - W in Theta( n * |S^n - W| ), which may be
dramatically different from Theta( n * |S^n| ).

Exactly. Hence, you use a construction that guarantees that the time
needed is proportional to n*|S^n - W|: Every step you do will be
enecessary to produce at least one word in the output set. Now, if
|S^n - W| is still exponential, then you'll still need exponential
time. But nevertheless, that's the best you can hope for.
The target set is specified as S^n - W, where W is everything matching
(.*a.*b|.*b.*a). Following the construction procedure in point, this
exclusion set is matched exactly by my DFA with S initial and F final.
Then, swapping final and non-final states makes {S, A, B} final, and F
non-final. Your DFA above may be equivalent, but to me it is far from
clear exactly what algorithm would build it from the given data.

Well, it's just the result from the minimazation algorithm, where my
variant of the algorithm just prunes away the "stuck" state which can
never produce any output.
No, this is not the point. Naive filtering already does that.

No, it doesn't. Naive filtering always will look at the complete
input set, so, no matter what size |S^n - W| actually is, it will
always take time in proportion to |S^n|.
By "cutting efficiently" I mean skipping over search sub-trees that
don't contain any results from the target set.

Yes. Consider a different example: With the wildcard expressions
W = { b*, aa*, ab* }, you'll get S^* - W = { a }. The resulting
minimum FSM will just accept 'a' (start state, one final state, and
the "stuck" state if you insist on it), so you skip over every
other subtree when enumerating results from that automaton.

And for the previous example, you'll need something like 2*n time
to enumerate the output set instead of 2^n, because once you're in
the "a-branch", you're producing only a's, and you're pruning
away all the subtrees that start with a "b". Similarly in the
"b-branch".

Now clearer?

- Dirk
 
D

Dirk Thierbach

Call a wc 'free' if it satisfies the propery that every letter 'a' in
it appears only in the form '*a*', and 'anchored' otherwise.

Would '*ab*' be free or anchored?
What if all wc's are free? How does this affect the DFA?

I don't know. The important point here is the interaction of all
the wc's. I don't think properties like this do reduce the complexity
of interaction in an obvious way.
Does it minimize nontrivially?

I am not sure what you mean by this. Every DFA minimizes to some
other DFA, which is unique up to renaming of states. Now the
question is if that minimization reduces the complexity enough
to be noticeable (maybe that's what you mean by "nontrivially").
In general, I don't think one can say much about that without
looking at concrete examples.

- Dirk
 
D

Dinko Tenev

Dirk Thierbach wrote:

[A lot of stuff]
Now clearer?

- Dirk

Actually, it's getting all the messier, and we seem to be running
around in circles. I've already lost track of the point you're trying
to make, and it seems that you're missing most of my points.

Let's leave it there, and take a break.
 
F

funkyj

Going in a slightly different direction ...

There has been lots of published work on how to create efficient FSMs
from regexps. Generally these FSMs are used for pattern matching (i.e.
"does string 's' match regexp 'e'?").

Is there any corresponding literature on the topic addressed by the
OP's challenge of generating the languaged defined by a regexp (or the
complement of that regexp)?

--jfc
 
C

Chris F Clark

Yes, there is literature on the generating side of the regular
expression/FSM model. In fact, the matching problem and the
generating problems are exactly equivalent. A slight variation of the
definition of how a matcher works, turns it into a generator and vice
versa. To directly generate (rather than match) from an FSM, one
simply does a walk (e.g. depth first search or breath first search)
over the machine writing out (rather than matching) the symbols used
as labels. Thus, all the theorems about matching apply to generating
and also in reverse.

You can see this to some extent form the problem posed. If one
generates Sigma* and subtracts out the elements from some regular
language (say by matching them), that is exactly equivalent (in
strings generated) to generating the complement of the regular
language.

In fact, it is quite easy (with the correct regular expression tool,
i.e. one that handles regular expression differences) to take the
problems posed and generate provably minimal (i.e. provably maximally
efficient) generation (or matching) programs as FSMs. The provably
minimal FSM won't go down any paths that don't have some sequence that
generates an "accept" value. It is worth noting the regular languages
are closed under the difference operator, so that resulting language
from substracting one regular expression from another is still a
regular language, which can be used to prove that the machine is
minimal.

Therefore, while the output can be exponentially larger than the
input, one should expect that implementations should be able to
generate the output in linear time (relative to the size of the
output), since FSMs run in linear time relative to the string they are
processing (whether generating or matching). Under a reasonable model
of computation, one cannot do better than linear in the size of string
processed.

I'm sure if you asked under comp.theory, you would get tons of other
relevant facts from someone who understands automata theory better
than I. Note, if one does ask there, one should correct the notation.
The "*" symbol was used as in globbing, not as commonly used in
regular expressions. The notation ".*" (as someone corrected in one
of their replies) is the normal notation for what the original poster
wanted by "*".

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : (e-mail address removed)
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
 
F

funkyj

Chris said:
Yes, there is literature on the generating side of the regular
expression/FSM model. In fact, the matching problem and the
generating problems are exactly equivalent. A slight variation of the
definition of how a matcher works, turns it into a generator and vice
versa. To directly generate (rather than match) from an FSM, one
simply does a walk (e.g. depth first search or breath first search)
over the machine writing out (rather than matching) the symbols used
as labels. Thus, all the theorems about matching apply to generating
and also in reverse.

If the language is Sigma* (rather than Sigma^n in the original post)
then doing a depth first search over the FSM requires a stack to
maintain context, right? I find it interesting that you suggest a PDA
(push down automata) is required to enumerate the language accepted by
an FSM since PDAs are strongly associated with CFGs (context free
grammars). Does it follow then that a Turing machine is required to
enumerate the language defined by a CFG? (that was a joke, I think).

It is worth noting the regular languages
are closed under the difference operator, so that resulting language
from substracting one regular expression from another is still a
regular language,

I was pretty sure this was the case but it has been more than a decade
since I studied computational models in school so I wasn't absolutely
certain. While I do remember homework that called for creating FSMs
that accepted languages defined by a regexp, I don't recall having
solved the "enumeration" problem.

Your clear and concise comments did a nice job of jogging my memory.

===

It seems to me that the purpose of the original challenge was to
compare how different languages implement the solution to the problem.
For such a well understood problem as this the goal of the challenge
(to contrast the languages) is best served when participants have a
good understanding of the 'state of the art' abstract solutions (e.g.
using a stack to traverse a FSM in DFS fashion).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,291
Messages
2,571,453
Members
48,137
Latest member
IndiraMcCo

Latest Threads

Top