Slow ruby regexes

R

Robert Dober

<snip>

Its funny, back in the 80..83, the courses I enjoyed the most were
computability, parallel processing, compilers and assembly. Now, when I think
about anything in those areas, I just have vague recollections and no real
facts. Since those days, work has been about databases, dull user interface
programming, 4GLs, web technology etc. Up until 4 months ago, it was an increaseing deluge of
project management methodologies, bullshitp bingo management speak, HR, budgets
and sales. Now I'm getting back to bare boned technology and its scary how much
I've forgotten! Still, at least I'm now looking at Ruby (amongst many other
things!).
Amen, but I think one shall be optimist ;)
Robert
 
R

Robert Klemme

It's important not to confuse NFAs with backtracking. For instance, Thompson's algorithm uses NFAs, but it evaluates by rewriting them on-the-fly as DFAs, not by performing a backtracking search. Again, NFAs and DFAs are equivalent, even if the naive algorithm (backtracking) for evaluating an NFA is less efficient than the naive algorithm for evaluating a DFA.

On the other hand, those modern features which really do require backtracking (surprisingly few) also exceed the capabilities of finite automata (i.e. those features correspond to non-regular languages), so at that point the regular expression engines cannot be said to be using FAs at all, but a more general type of automaton.

Strinctly speaking yes. But AFAIK the basis is usually a NFA.

Btw, because of the nature of the engine there are some more differences
between DFA and NFA, for example typically order of expressions in an
alternative matters with a NFA engine. I prefer that often, because it
allows to control the way the RX engine matches to a certain extent,
i.e. you can prioritize alternatives which can be pretty handy at times.
Again, though we must obviously keep backtracking in reserve for those patterns that need it, it is very wasteful not to use a pure FA approach for those patterns where non-regular features aren't being used.


"Mastering Regular Expressions" is very good for learning the pragmatics of working with contemporary regular expression implementations, but it is not a good source for theory.

Since I had the theory at university and at the moment I'm more
interested in using them than writing them this is good enough for me. :)

Kind regards

robert
 
M

MenTaLguY

--=-8LFfgDJxfwIo2B4Jq26r
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

Btw, because of the nature of the engine there are some more differences=20
between DFA and NFA, for example typically order of expressions in an=20
alternative matters with a NFA engine. I prefer that often, because it=20
allows to control the way the RX engine matches to a certain extent,=20
i.e. you can prioritize alternatives which can be pretty handy at times.

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs _really
are_ equivalent in expressiveness.

-mental

--=-8LFfgDJxfwIo2B4Jq26r
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQBGH+HBSuZBmZzm14ERAvxvAJoDwNX5QGU3c0N26SRDIUcWSp7yNwCfe7mN
W1HagW3eooNPKyn94RhqzRw=
=i2SD
-----END PGP SIGNATURE-----

--=-8LFfgDJxfwIo2B4Jq26r--
 
R

Robert Klemme

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs _really
are_ equivalent in expressiveness.

How do you do that?

Kind regards

robert
 
M

MenTaLguY

--=-eLdkntk2YfmG8l9RGh49
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

=20
How do you do that?

Let's use a real simple example: /^(?:(ab)|(..))$/

Here, we want capturing subexpression 1 to have priority over capturing
subexpression 2. This time, I'll use a [] notation to annotate
transitions with capture boundaries and priorities.

First, the NFA:

start: ('' [start $1]->group_1 | '' [start $2]->group_2),
group_1: ('a' ->group_1b),
group_1b: ('b' [end $1,priority 1]->final),
group_2: (any ->group_2b),
group_2b: (any [end $2,priority 0]->final)

Now, the equivalent DFA:

{start}: ('' [start $1,start $2]->{group_1,group_2}),
{group_1,group_2}: (
'a' ->{group_1b,group_2b} |
(any-'a') ->{group_2b}
),
{group_1b,group_2b}: (
'b' [end $1]->{final} |
(any-'b') [end $2]->{final}
)
{group_2b}: (any [end $2]->{final})

See how that works?

More complex examples can be attacked using the same principles, though
it gets painful to work by hand pretty fast.

-mental

--=-eLdkntk2YfmG8l9RGh49
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQBGIFmpSuZBmZzm14ERAomIAKDT4xgwzMN9kLSaXGFiIy7Jwte3RgCglEYC
XEpQtL86jjUDI8Dk80pxZ9E=
=sPqJ
-----END PGP SIGNATURE-----

--=-eLdkntk2YfmG8l9RGh49--
 
M

MenTaLguY

--=-ibs43dWNRBfuPlWuP071
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

To be fair, I oversimplified a bit because I couldn't think of a better
example until just now. Though you can prune the DFA with priorities in
the NFA, an expression like /^(?:(abc)|(a)).*$/ still requires more
sophisticated handling, explicitly tracking alternative matches.

Here's an implementation of a Thompson NFA which does Perl-style
capture:

http://swtch.com/~rsc/regexp/nfa-perl.y.txt

The same approach works if you construct the DFA in advance or via
memoization of the Thompson NFA evaluation (which is probably the better
of the two for reasons already discussed).

-mental

--=-ibs43dWNRBfuPlWuP071
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQBGINWxSuZBmZzm14ERArnWAJwJPaT22kAGq5/AlFDUMYR7vv8gUACfYUBS
p1eMFB5yypVM0tCyQENNq7A=
=WhvQ
-----END PGP SIGNATURE-----

--=-ibs43dWNRBfuPlWuP071--
 
R

Robert Klemme

To be fair, I oversimplified a bit because I couldn't think of a better
example until just now. Though you can prune the DFA with priorities in
the NFA, an expression like /^(?:(abc)|(a)).*$/ still requires more
sophisticated handling, explicitly tracking alternative matches.

Here's an implementation of a Thompson NFA which does Perl-style
capture:

http://swtch.com/~rsc/regexp/nfa-perl.y.txt

The same approach works if you construct the DFA in advance or via
memoization of the Thompson NFA evaluation (which is probably the better
of the two for reasons already discussed).

Hmm... I was more after how you would notate the RX for a DFA based
engine to do the prioritization. I do not know a DFA based engine that
would actually allow to write a RX pattern to do that.

Of course, when building engines manually you can do a lot - and
probably a lot more than what actual RX engines allow. But since I'm
interested in *using* them vs. writing RX engines your solution was not
exactly what I expected. :)

Kind regards

robert
 
M

MenTaLguY

Hmm... I was more after how you would notate the RX for a DFA based
engine to do the prioritization.

We note choice points in our tags, and also tag those subsequent transitions which would disprove a particular choice (for instance, for /^(?:(abc)|(a)).*$/, the second transition in "axy" disproves the first alternative, and the third transition in "abc" disproves the second alternative).

Referring to those tags as we encounter them during FA evaluation, we can then accumulate sets of plausible captures and cull those which are disproven as we go.
I do not know a DFA based engine that would actually allow to write a RX
pattern to do that.

A non-backtracking NFA simulation (i.e. Thompson NFA) operates by constructing and evaluating the corresponding DFA on-the-fly (each unique set of NFA states corresponds to a DFA state). If you memoize the construction of DFA states, then you've got a straight DFA engine (the DFA is constructed lazily). Some RX engines have historically built the entire DFA up front, but as previously discussed, that approach is wasteful (many of the DFA states so constructed may never get visited).

The same techniques (using tagged transitions to drive other code as the FA is evaluated) are applicable in all three cases (non-backtracking NFA simulation, dynamic DFA construction, and static DFA construction).
Of course, when building engines manually you can do a lot - and
probably a lot more than what actual RX engines allow.

Right, so I thought it appropriate to give an example of an actual RX engine which allowed the same (and more) as my hand-worked examples by extending the same class of techniques I had illustrated so far.
But since I'm interested in *using* them vs. writing RX engines your
solution was not exactly what I expected. :)

I'll admit to being a little confused -- you're interested in using a RX engine rather than implementing one, but you'd rather see a description and/or hand-worked example of an implementation technique rather than a link to a finished implementation using the same techniques under discussion?

-mental
 
R

Robert Klemme

I'll admit to being a little confused -- you're interested in using a RX engine rather than implementing one, but you'd rather see a description and/or hand-worked example of an implementation technique rather than a link to a finished implementation using the same techniques under discussion?

LOL - now I'm confused, too. :) I was looking for something like (made
up fake): "To prioritize branches of an alternative in sed you do sed -e
's/foo(!2bar|!1baz)/yummy\1/g'." Of course this does not work with sed
(at least no implementation of sed that I know). Are there other tools
that allow to embed something into a RX that will do this prioritization?

Kind regards

robert
 
M

MenTaLguY

LOL - now I'm confused, too. :) I was looking for something like (made
up fake): "To prioritize branches of an alternative in sed you do sed -e
's/foo(!2bar|!1baz)/yummy\1/g'." Of course this does not work with sed
(at least no implementation of sed that I know). Are there other tools
that allow to embed something into a RX that will do this prioritization?

OHHHHHHHH.

| is a deterministic choice operator, so the branches are prioritized by default -- earlier alternatives take precedence over later ones. Just write your RX alternatives from highest to lowest priority.

Does that answer your question?

-mental
 
R

Robert Klemme

OHHHHHHHH.

I hope it doesn't hurt too much. ;-)
| is a deterministic choice operator, so the branches are prioritized by default -- earlier alternatives take precedence over later ones. Just write your RX alternatives from highest to lowest priority.

Does that answer your question?

Maybe. :)

I was under the impression that this precisely is a point of difference
between NFA and DFA based engines: sed is DFA based and for all I know
order in an alternative does not matter while it does matter for the
typical NFA engine. Did I get this wrong?

Kind regards

robert
 
M

MenTaLguY

I was under the impression that this precisely is a point of difference
between NFA and DFA based engines: sed is DFA based and for all I know
order in an alternative does not matter while it does matter for the
typical NFA engine. Did I get this wrong?

Well, for historical reasons, sed implements only "basic" (versus "extended") regular expressions, so it does not have explicit alternation via the | operator. However, we can still look at it for an example of how a DFA-based engine can handle choosing between alternatives.

Consider:

echo "aaaaa" | sed -e 's/^\(a*\)\(a*\)$/\1,\2/'

There are six possible ways to match this regular expression:

aaaaa,
aaaa,a
aaa,aa
aa,aaa
a,aaaa
,aaaaa

sed always chooses the first alternative. Choice between alternatives for explicit alternation would be handled the same way.

-mental
 
R

Robert Klemme

Well, for historical reasons, sed implements only "basic" (versus "extended") regular expressions, so it does not have explicit alternation via the | operator. However, we can still look at it for an example of how a DFA-based engine can handle choosing between alternatives.

Ah! I reckon this a specialty of GNU sed:

09:45:46 [~]: sed -ne 's#a\|b#A#gp' <<X
aaaa
bbbb
X
AAAA
AAAA
09:46:02 [~]:

On an old Solaris box I get

bash-2.03# sed -ne 's#a|b#A#gp' <<X
aaaaa
bbbbb
X
bash-2.03#

Of course GNU sed is more capable than "standard" sed...
Consider:

echo "aaaaa" | sed -e 's/^\(a*\)\(a*\)$/\1,\2/'

There are six possible ways to match this regular expression:

aaaaa,
aaaa,a
aaa,aa
aa,aaa
a,aaaa
,aaaaa

sed always chooses the first alternative. Choice between alternatives for explicit alternation would be handled the same way.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,240
Messages
2,571,211
Members
47,845
Latest member
vojosay

Latest Threads

Top