if VAR =~ one of a list?

H

healyzh

I'm in the process of learning Ruby, and just finished my first script that
is intended to do actual work. I doing so I came across an interesting
question.

How would I efficiently write the following statement?

if my_variable =~ /^some/i || my_variable =~ /^soome/i || my_variable
=~ /^save/i || my_variable =~ /^Catch/i || my_variable =~ /^BLARG/i ||
my_variable =~ /^safe/i || my_variable =~ /^blarg/i || my_variable =~
/^wrest/i ||my_variable =~ /^template/

Zane
 
B

Bit Twiddler

Jeffrey Schwab said:
Because the OP asked that the statement be written "efficiently." Text
between parentheses ordinarily is captured, i.e. stored in a special
variable, after each match. The ?: tells the regular expression parser
that the parentheses are being used only for grouping, and thereby avoids
the overhead of capturing text.

Ahh... Thanks!!
 
R

Robert Klemme

When talking about efficiency then the pattern can be made even better
(manual, probably incomplete optimization):

/^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

I.e. build a Trie and convert that into a regexp.

Cheers

robert
 
R

Robert Klemme

Jeffrey said:
Robert said:
When talking about efficiency then the pattern can be made even better
(manual, probably incomplete optimization):

/^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

True, since Ruby has only an NFA (rather than a DFA) regex engine. I
find this equivalent code a little bit more readable, though:

/^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)/i

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Cheers

robert
 
R

Robert Klemme

Jeffrey said:
Robert said:
Jeffrey said:
Robert Klemme wrote:
Bit Twiddler wrote:
Bit Twiddler wrote:
if my_variable =~
/^(?:some|soome|save|catch|blarg|safe|wrest|template)/i
Why is the "?:" required?
Because the OP asked that the statement be written "efficiently."
Text between parentheses ordinarily is captured, i.e. stored in a
special variable, after each match. The ?: tells the regular
expression parser that the parentheses are being used only for
grouping, and thereby avoids the overhead of capturing text.

When talking about efficiency then the pattern can be made even
better (manual, probably incomplete optimization):

/^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

True, since Ruby has only an NFA (rather than a DFA) regex engine. I
find this equivalent code a little bit more readable, though:

/^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)/i

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Where? I don't see it. If anything, it looks like there is less
potential backtracking, since the e after the first alternation is not
duplicated.

Every "?", "*", "+" and "|" can cause backtracking. The plain tree
approach (my RX) immediately fails on the first character of every
branch if it doesn't match while IMHO your RX needs to test more
characters before it can fail. That's where the backtracking occurs.

Do you know the regexp coach? It makes some aspects of RX more visible
- unfortunately I could not find a way that highlights backtracking
but the tree view is quite informative.

http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice that
you need 16 clicks on "next" for my RX and 18 for yours until the match
finally failed.

Kind regards

robert
 
R

Robert Klemme

Jeffrey said:
Robert said:
Jeffrey said:
Robert Klemme wrote:
Jeffrey Schwab wrote:
Robert Klemme wrote:
Bit Twiddler wrote:
Bit Twiddler wrote:
if my_variable =~
/^(?:some|soome|save|catch|blarg|safe|wrest|template)/i
Why is the "?:" required?
Because the OP asked that the statement be written
"efficiently." Text between parentheses ordinarily is captured,
i.e. stored in a special variable, after each match. The ?:
tells the regular expression parser that the parentheses are
being used only for grouping, and thereby avoids the overhead of
capturing text.

When talking about efficiency then the pattern can be made even
better (manual, probably incomplete optimization):

/^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

True, since Ruby has only an NFA (rather than a DFA) regex engine.
I find this equivalent code a little bit more readable, though:

/^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)/i

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Where? I don't see it. If anything, it looks like there is less
potential backtracking, since the e after the first alternation is
not duplicated.
Do you know the regexp coach? It makes some aspects of RX more
visible - unfortunately I could not find a way that highlights
backtracking but the tree view is quite informative.

http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice
that you need 16 clicks on "next" for my RX and 18 for yours until the
match finally failed.

Thanks, I didn't know about that site. I'll check it out when I have a
little more time. I don't think you're counting backtracks, though; I
count 3 backtracked characters for either regex applied to "soomt": the
two o's and the m, since both patterns fail immediately at the t. In
fact, I can't think of any case off the top of my head for which either
regex does more backtracking than the other.

If you watch for the sequence currently matched you'll see this

^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)
s
so
s
<empty>
....

^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)
s
so
soo
soom
so
s
<empty>
....

I'd say, more backtracking for the second RX.

I think I've also seen a RX engine that can be instrumented with
callbacks somewhere - that would also

Kind regards

robert
 
R

Robert Klemme

Jeffrey said:
Robert said:
/^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

Jeffrey said:
/^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)/i

Robert said:
http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice
that you need 16 clicks on "next" for my RX and 18 for yours until
the match finally failed.
If you watch for the sequence currently matched you'll see this

^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)
s
so
s

That's because the tool isn't showing you that it tried to match "me",
failed, matched "om", failed at the "e", and backtracked the second "o"
and the "m". It still happened; it's just that the engine tried to
match the "ome" sequence all at once, so it considers the failure
atomic, and ignores the fact that its string-matching loop got to "e"
before realizing there was no match. Most DFAs probably use the same
technique, so your regex will run faster on those, and probably require
less memory too.
<empty>
....

^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)
s
so
soo
soom
so
s
<empty>
....

I'd say, more backtracking for the second RX.

The same characters were backtracked.

But with a different number of branches tested.
Which pattern is more efficient
will depend on the specifics of the regex engine.

Well, this statement is never wrong. In this case we are talking about
a NFA and I don't see any optimization a RX engine could employ. Of
course I may be missing something but I'd say the "tree" version (below)
is generally better on an NFA.
We could always try
timeit, since we're in Ruby...

17:39:31 [Temp]: ./rx-bm.rb
Rehearsal ----------------------------------------
tree 9.485000 0.000000 9.485000 ( 9.534000)
long 9.984000 0.000000 9.984000 ( 9.858000)
------------------------------ total: 19.469000sec

user system total real
tree 9.516000 0.000000 9.516000 ( 9.545000)
long 9.765000 0.000000 9.765000 ( 9.797000)
17:40:24 [Temp]: ruby -e 'p 9.765000 / 9.516000'
1.02616645649433

Kind regards

robert


#!/usr/bin/env ruby

require 'benchmark'

ITER = 10_000_000
STR = "soomt"

Benchmark.bmbm do |x|
x.report("tree") do
ITER.times { /^(?:s(?:eek:(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/ =~ STR }
end

x.report("long") do
ITER.times { /^(?:s(?:eek:o?m|a[vf])e|catch|blarg|wrest|template)/ =~ STR }
end
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,206
Messages
2,571,069
Members
47,678
Latest member
Aniruddha Das

Latest Threads

Top