if VAR =~ one of a list?

healyzh · Jun 7, 2006

I'm in the process of learning Ruby, and just finished my first script that
is intended to do actual work. I doing so I came across an interesting
question.

How would I efficiently write the following statement?

if my_variable =~ /^some/i || my_variable =~ /^soome/i || my_variable
=~ /^save/i || my_variable =~ /^Catch/i || my_variable =~ /^BLARG/i ||
my_variable =~ /^safe/i || my_variable =~ /^blarg/i || my_variable =~
/^wrest/i ||my_variable =~ /^template/

Zane

Bit Twiddler · Jun 7, 2006

if my_variable =~

/^(?:some|soome|save|catch|blarg|safe|wrest|template)/i

Why is the "?:" required?

Thanks,
BT

Bit Twiddler · Jun 8, 2006

Jeffrey Schwab said:
Because the OP asked that the statement be written "efficiently." Text
between parentheses ordinarily is captured, i.e. stored in a special
variable, after each match. The ?: tells the regular expression parser
that the parentheses are being used only for grouping, and thereby avoids
the overhead of capturing text.

Ahh... Thanks!!

Robert Klemme · Jun 8, 2006

When talking about efficiency then the pattern can be made even better
(manual, probably incomplete optimization):

/^(?:s(?

(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

I.e. build a Trie and convert that into a regexp.

Cheers

robert

Robert Klemme · Jun 12, 2006

Jeffrey said:
Robert said:

When talking about efficiency then the pattern can be made even better
(manual, probably incomplete optimization):

/^(?:s(?(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

Click to expand...

True, since Ruby has only an NFA (rather than a DFA) regex engine. I
find this equivalent code a little bit more readable, though:

/^(?:s(?o?m|a[vf])e|catch|blarg|wrest|template)/i

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Cheers

robert

Robert Klemme · Jun 13, 2006

Jeffrey said:
Robert said:

Jeffrey said:

Robert Klemme wrote:
Bit Twiddler wrote:
Bit Twiddler wrote:
if my_variable =~
/^(?:some|soome|save|catch|blarg|safe|wrest|template)/i
Why is the "?:" required?
Because the OP asked that the statement be written "efficiently."
Text between parentheses ordinarily is captured, i.e. stored in a
special variable, after each match. The ?: tells the regular
expression parser that the parentheses are being used only for
grouping, and thereby avoids the overhead of capturing text.

When talking about efficiency then the pattern can be made even
better (manual, probably incomplete optimization):

/^(?:s(?(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

True, since Ruby has only an NFA (rather than a DFA) regex engine. I
find this equivalent code a little bit more readable, though:

/^(?:s(?o?m|a[vf])e|catch|blarg|wrest|template)/i

Click to expand...

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Click to expand...

Where? I don't see it. If anything, it looks like there is less
potential backtracking, since the e after the first alternation is not
duplicated.

Every "?", "*", "+" and "|" can cause backtracking. The plain tree
approach (my RX) immediately fails on the first character of every
branch if it doesn't match while IMHO your RX needs to test more
characters before it can fail. That's where the backtracking occurs.

Do you know the regexp coach? It makes some aspects of RX more visible
- unfortunately I could not find a way that highlights backtracking
but the tree view is quite informative.

http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice that
you need 16 clicks on "next" for my RX and 18 for yours until the match
finally failed.

Kind regards

robert

Robert Klemme · Jun 13, 2006

Jeffrey said:
Robert said:

Jeffrey said:

Robert Klemme wrote:
Jeffrey Schwab wrote:
Robert Klemme wrote:
Bit Twiddler wrote:
Bit Twiddler wrote:
if my_variable =~
/^(?:some|soome|save|catch|blarg|safe|wrest|template)/i
Why is the "?:" required?
Because the OP asked that the statement be written
"efficiently." Text between parentheses ordinarily is captured,
i.e. stored in a special variable, after each match. The ?:
tells the regular expression parser that the parentheses are
being used only for grouping, and thereby avoids the overhead of
capturing text.

When talking about efficiency then the pattern can be made even
better (manual, probably incomplete optimization):

/^(?:s(?(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

True, since Ruby has only an NFA (rather than a DFA) regex engine.
I find this equivalent code a little bit more readable, though:

/^(?:s(?o?m|a[vf])e|catch|blarg|wrest|template)/i

Probably. But readable != efficient. There's definitively more
backtracking in this RX than in the other one.

Where? I don't see it. If anything, it looks like there is less
potential backtracking, since the e after the first alternation is
not duplicated.

Click to expand...

Click to expand...

Do you know the regexp coach? It makes some aspects of RX more
visible - unfortunately I could not find a way that highlights
backtracking but the tree view is quite informative.

http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice
that you need 16 clicks on "next" for my RX and 18 for yours until the
match finally failed.

Click to expand...

Thanks, I didn't know about that site. I'll check it out when I have a
little more time. I don't think you're counting backtracks, though; I
count 3 backtracked characters for either regex applied to "soomt": the
two o's and the m, since both patterns fail immediately at the t. In
fact, I can't think of any case off the top of my head for which either
regex does more backtracking than the other.

If you watch for the sequence currently matched you'll see this

^(?:s(?

(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)
s
so
s
<empty>
....

^(?:s(?

Robert Klemme · Jun 13, 2006

Jeffrey said:
Robert said:

/^(?:s(?(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/i

Click to expand...

Jeffrey said:

/^(?:s(?o?m|a[vf])e|catch|blarg|wrest|template)/i

Click to expand...

Robert said:

http://www.weitz.de/regex-coach/

If you test both RX against "soomt" on the "Step" tab you'll notice
that you need 16 clicks on "next" for my RX and 18 for yours until
the match finally failed.

Click to expand...

If you watch for the sequence currently matched you'll see this

^(?:s(?(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)
s
so
s

Click to expand...

That's because the tool isn't showing you that it tried to match "me",
failed, matched "om", failed at the "e", and backtracked the second "o"
and the "m". It still happened; it's just that the engine tried to
match the "ome" sequence all at once, so it considers the failure
atomic, and ignores the fact that its string-matching loop got to "e"
before realizing there was no match. Most DFAs probably use the same
technique, so your regex will run faster on those, and probably require
less memory too.

<empty>
....

^(?:s(?o?m|a[vf])e|catch|blarg|wrest|template)
s
so
soo
soom
so
s
<empty>
....

I'd say, more backtracking for the second RX.

Click to expand...

The same characters were backtracked.

But with a different number of branches tested.

Which pattern is more efficient
will depend on the specifics of the regex engine.

Well, this statement is never wrong. In this case we are talking about
a NFA and I don't see any optimization a RX engine could employ. Of
course I may be missing something but I'd say the "tree" version (below)
is generally better on an NFA.

We could always try
timeit, since we're in Ruby...

17:39:31 [Temp]: ./rx-bm.rb
Rehearsal ----------------------------------------
tree 9.485000 0.000000 9.485000 ( 9.534000)
long 9.984000 0.000000 9.984000 ( 9.858000)
------------------------------ total: 19.469000sec

user system total real
tree 9.516000 0.000000 9.516000 ( 9.545000)
long 9.765000 0.000000 9.765000 ( 9.797000)
17:40:24 [Temp]: ruby -e 'p 9.765000 / 9.516000'
1.02616645649433

Kind regards

robert

#!/usr/bin/env ruby

require 'benchmark'

ITER = 10_000_000
STR = "soomt"

Benchmark.bmbm do |x|
x.report("tree") do
ITER.times { /^(?:s(?

(?:me|ome)|a(?:ve|fe))|catch|blarg|wrest|template)/ =~ STR }
end

x.report("long") do
ITER.times { /^(?:s(?

o?m|a[vf])e|catch|blarg|wrest|template)/ =~ STR }
end
end

Getting to one attribute of my marshalled array of objects	8	Sep 18, 2009
How to download a list of PDF files?	2	Apr 28, 2008
How to access elemenst in a list of lists?	6	May 9, 2011
How to check if any item from a list of strings is in a big string?	13	Jul 10, 2009
How to list the statement of all var ?	34	Dec 26, 2005
How to get the list of all my open file(descriptor)s and locks?	0	Sep 19, 2012
comp.lang.c FAQ list Table of Contents	0	Jan 12, 2008
is list comprehension necessary?	15	Oct 26, 2010

if VAR =~ one of a list?

healyzh

Bit Twiddler

Bit Twiddler

Robert Klemme

Robert Klemme

Robert Klemme

Robert Klemme

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads