a regular expression problem

cppasm · Mar 5, 2007

hi,it seems that there's some problem with the regeular expression
processing in ruby,when I ran the code below, it happened that it loop
for ever and would not get out.
$value = "20053301831294544645645645634556354"
if $value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?

ics|pictures|
images|photos|movies)/i
puts "hit"
else
puts "miss"
end
any idea? Thank u!

Bill Kelly · Mar 5, 2007

From: said:
hi,it seems that there's some problem with the regeular expression
processing in ruby,when I ran the code below, it happened that it loop
for ever and would not get out.
$value = "20053301831294544645645645634556354"
if $value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|
images|photos|movies)/i
puts "hit"
else
puts "miss"
end
any idea? Thank u!

You can try this in irb, like:

value = "20053301831294544645645645634556354" => "20053301831294544645645645634556354"
value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

Click to expand...

(takes a long time)

If you shorten value a bit:

value = "2005330183129454464564" => "2005330183129454464564"
value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

Click to expand...

=> nil

You'll see that it takes a second or two, then finishes.

The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
...of which there are exponential possibilities within a string of
digits...

Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.

Regards,

Bill

cppasm · Mar 5, 2007

From: said:
From: said:

hi,it seems that there's some problem with the regeular expression
processing in ruby,when I ran the code below, it happened that it loop
for ever and would not get out.
$value = "20053301831294544645645645634556354"
if $value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|
images|photos|movies)/i
puts "hit"
else
puts "miss"
end
any idea? Thank u!

Click to expand...

You can try this in irb, like:

=> "20053301831294544645645645634556354">> value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

(takes a long time)

If you shorten value a bit:

=> "2005330183129454464564">> value =~ /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

=> nil

You'll see that it takes a second or two, then finishes.

The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
..of which there are exponential possibilities within a string of
digits...

Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.

Regards,

Bill

Thank you so much!
I'm trying to rewrite SpamAssassin in ruby just for fun

. The
regexp mentioned is just a rule in SpamAssassin's rule files.It's used
to match against every line of a mail message.In fact I tested using
the value and regexp mentioned above in Java(JDK, oro and regexp of
Apache),perl,python and ruby.It finished very soon in perl and
python,while the same thing happened in Java(whichever library I
use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
matchings mechanism between them are quite different?

Bill Kelly · Mar 5, 2007

From: <[email protected]>

The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
..of which there are exponential possibilities within a string of
digits...

Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.

Click to expand...

Thank you so much!
I'm trying to rewrite SpamAssassin in ruby just for fun . The
regexp mentioned is just a rule in SpamAssassin's rule files.It's used
to match against every line of a mail message.In fact I tested using
the value and regexp mentioned above in Java(JDK, oro and regexp of
Apache),perl,python and ruby.It finished very soon in perl and
python,while the same thing happened in Java(whichever library I
use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
matchings mechanism between them are quite different?

Yes, although not so much the underlying engine, but some
regex matchers are optimized to try to short-circuit certain
patterns to fail fast. From what I've read, a lot of code has
been added to the Perl regexp matcher over the years to make it
smarter at failing fast on various nested variable-length
expressions.

Still, it is often possible to translate an expression into
something that will fail fast without requiring engine-specific
optimizations.

If you HAVE to parse SpamAssassin rule files without modification,
then, you will likely run into more situations like this where a
pattern that failed fast due to optimizations in the Perl matcher,
runs too slowly in a regexp engine without the equivalent shortcut
optimization. :-(

Hope this helps,

Bill

Robert Klemme · Mar 5, 2007

From: <[email protected]>

The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
..of which there are exponential possibilities within a string of
digits...

Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.

Click to expand...

Thank you so much!
I'm trying to rewrite SpamAssassin in ruby just for fun . The
regexp mentioned is just a rule in SpamAssassin's rule files.It's used
to match against every line of a mail message.In fact I tested using
the value and regexp mentioned above in Java(JDK, oro and regexp of
Apache),perl,python and ruby.It finished very soon in perl and
python,while the same thing happened in Java(whichever library I
use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
matchings mechanism between them are quite different?

Click to expand...

Yes, although not so much the underlying engine, but some
regex matchers are optimized to try to short-circuit certain
patterns to fail fast. From what I've read, a lot of code has
been added to the Perl regexp matcher over the years to make it
smarter at failing fast on various nested variable-length
expressions.

Still, it is often possible to translate an expression into
something that will fail fast without requiring engine-specific
optimizations.

If you HAVE to parse SpamAssassin rule files without modification,
then, you will likely run into more situations like this where a
pattern that failed fast due to optimizations in the Perl matcher,
runs too slowly in a regexp engine without the equivalent shortcut
optimization. :-(

Hm, one quick manual optimization is to use

value =~ /\b(?

ics|pictures|images|photos|movies)/i &&
value =~
/\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?

ics|pictures|images|photos|movies)/i

i.e. precheck the texts and use short circuit evaluation. That at least
reduces the number of patterns dramatically where the exponential
runtime hits you.

Ah, I found a better one:

value =~
/\b\d{1,3}(?:[,.]?\d{3})*.{0,20}\b(?

ics|pictures|images|photos|movies)/i

That one's really fast on your value because it involves much less
backtracking.

I don't have my "Mastering Regular Expressions" handy but there you can
also find techniques for "loop unrolling" which should help in this case.

Btw, you don't need to escape the dot in the character class. It does
not have special meaning there.

Kind regards

robert

cppasm · Mar 5, 2007

From: <[email protected]>

The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
..of which there are exponential possibilities within a string of
digits...
Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.
Thank you so much!
I'm trying to rewrite SpamAssassin in ruby just for fun . The
regexp mentioned is just a rule in SpamAssassin's rule files.It's used
to match against every line of a mail message.In fact I tested using
the value and regexp mentioned above in Java(JDK, oro and regexp of
Apache),perl,python and ruby.It finished very soon in perl and
python,while the same thing happened in Java(whichever library I
use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
matchings mechanism between them are quite different?

Click to expand...

Click to expand...

Yes, although not so much the underlying engine, but some
regex matchers are optimized to try to short-circuit certain
patterns to fail fast. From what I've read, a lot of code has
been added to the Perl regexp matcher over the years to make it
smarter at failing fast on various nested variable-length
expressions.

Click to expand...

Still, it is often possible to translate an expression into
something that will fail fast without requiring engine-specific
optimizations.

Click to expand...

If you HAVE to parse SpamAssassin rule files without modification,
then, you will likely run into more situations like this where a
pattern that failed fast due to optimizations in the Perl matcher,
runs too slowly in a regexp engine without the equivalent shortcut
optimization. :-(

Click to expand...

Hm, one quick manual optimization is to use

value =~ /\b(?ics|pictures|images|photos|movies)/i &&
value =~
/\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

i.e. precheck the texts and use short circuit evaluation. That at least
reduces the number of patterns dramatically where the exponential
runtime hits you.

Ah, I found a better one:

value =~
/\b\d{1,3}(?:[,.]?\d{3})*.{0,20}\b(?ics|pictures|images|photos|movies)/i

That one's really fast on your value because it involves much less
backtracking.

I don't have my "Mastering Regular Expressions" handy but there you can
also find techniques for "loop unrolling" which should help in this case.

Btw, you don't need to escape the dot in the character class. It does
not have special meaning there.

Kind regards

robert

Thank you for your optimization,but I think there's a little problem
with it .For example,when the value is
'2.efgabc pics' it won't match the original one but will match the
optimized one.I think adding '\d{3}' after the asteroid would be ok.

Regards,
cppasm

Robert Klemme · Mar 5, 2007

From: <[email protected]>
The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
..of which there are exponential possibilities within a string of
digits...
Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.
Thank you so much!
I'm trying to rewrite SpamAssassin in ruby just for fun . The
regexp mentioned is just a rule in SpamAssassin's rule files.It's used
to match against every line of a mail message.In fact I tested using
the value and regexp mentioned above in Java(JDK, oro and regexp of
Apache),perl,python and ruby.It finished very soon in perl and
python,while the same thing happened in Java(whichever library I
use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
matchings mechanism between them are quite different?
Yes, although not so much the underlying engine, but some
regex matchers are optimized to try to short-circuit certain
patterns to fail fast. From what I've read, a lot of code has
been added to the Perl regexp matcher over the years to make it
smarter at failing fast on various nested variable-length
expressions.
Still, it is often possible to translate an expression into
something that will fail fast without requiring engine-specific
optimizations.
If you HAVE to parse SpamAssassin rule files without modification,
then, you will likely run into more situations like this where a
pattern that failed fast due to optimizations in the Perl matcher,
runs too slowly in a regexp engine without the equivalent shortcut
optimization. :-(

Click to expand...

Hm, one quick manual optimization is to use

value =~ /\b(?ics|pictures|images|photos|movies)/i &&
value =~
/\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

i.e. precheck the texts and use short circuit evaluation. That at least
reduces the number of patterns dramatically where the exponential
runtime hits you.

Ah, I found a better one:

value =~
/\b\d{1,3}(?:[,.]?\d{3})*.{0,20}\b(?ics|pictures|images|photos|movies)/i

That one's really fast on your value because it involves much less
backtracking.

I don't have my "Mastering Regular Expressions" handy but there you can
also find techniques for "loop unrolling" which should help in this case.

Btw, you don't need to escape the dot in the character class. It does
not have special meaning there.

Kind regards

robert

Click to expand...

Thank you for your optimization,but I think there's a little problem
with it .For example,when the value is
'2.efgabc pics' it won't match the original one but will match the
optimized one.I think adding '\d{3}' after the asteroid would be ok.

What asteroid? Is an impact imminent? Do we have to duck and cover?

SCNR

Seriously: I thought with your initial portion of the RX you wanted to
match something like "1,000,000", "1,000" and "10" (plus "." as
separator as well as no separator) although I realize that your original
does also match "1,0,0,00,0" etc. Your suggested "fix" won't make the
two RX more similar with regard to matching because mine always matches
exactly three digits between a separator. Since I don't know what you
are actually trying to match I'm out of suggestions at the moment.

Kind regards

robert

cppasm · Mar 5, 2007

Well, just behind the only asteroid in your optimization regex

.:
your version:
/\b\d{1,3}(?:[,.]?\d{3})*.{0,20}\b(?

ics|pictures|images|photos|
movies)/i
revision to your version:
/\b\d{1,3}(?:[,.]?\d{3})*\d{3}.{0,20}\b(?

ics|pictures|images|photos|
movies)/i

In fact the original regex appeared in one of
SpamAssassin(spamassassin.apache.org)'s rule files(20_porn.cf).Here is
the part of that file describing this regex:
body LOTS_OF_STUFF /\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?

ics|
pictures|images|photos|movies)/i
describe LOTS_OF_STUFF Thousands or millions of pictures, movies,
etc.
It's used to match against every line of a mail message. I think it's
created based on the analysis of many mail messages.
"Robert Klemme Ð´µÀ£º
"

On 05.03.2007 08:08, Bill Kelly wrote:

From: <[email protected]>
The problem is, the regexp involves nested variable-length matches,
so when no match is found, the engine keeps backtracking and trying
to find other possible matches, across all combinations of
(?:\d{1,3}[,\.]?)+\d{3}.{0,20}
..of which there are exponential possibilities within a string of
digits...
Hopefully, it may be possible to rewrite your regexp to fail more
quickly on bad data.If you could tell us more about the exact
specification of the data you're trying to match, people might
be able to help further.
Thank you so much!
I'm trying to rewrite SpamAssassin in ruby just for fun . The
regexp mentioned is just a rule in SpamAssassin's rule files.It's used
to match against every line of a mail message.In fact I tested using
the value and regexp mentioned above in Java(JDK, oro and regexp of
Apache),perl,python and ruby.It finished very soon in perl and
python,while the same thing happened in Java(whichever library I
use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
matchings mechanism between them are quite different?
Yes, although not so much the underlying engine, but some
regex matchers are optimized to try to short-circuit certain
patterns to fail fast. From what I've read, a lot of code has
been added to the Perl regexp matcher over the years to make it
smarter at failing fast on various nested variable-length
expressions.
Still, it is often possible to translate an expression into
something that will fail fast without requiring engine-specific
optimizations.
If you HAVE to parse SpamAssassin rule files without modification,
then, you will likely run into more situations like this where a
pattern that failed fast due to optimizations in the Perl matcher,
runs too slowly in a regexp engine without the equivalent shortcut
optimization. :-(
Hm, one quick manual optimization is to use

value =~ /\b(?ics|pictures|images|photos|movies)/i &&
value =~
/\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?ics|pictures|images|photos|movies)/i

i.e. precheck the texts and use short circuit evaluation. That at least
reduces the number of patterns dramatically where the exponential
runtime hits you.

Ah, I found a better one:

value =~
/\b\d{1,3}(?:[,.]?\d{3})*.{0,20}\b(?ics|pictures|images|photos|movies)/i

That one's really fast on your value because it involves much less
backtracking.

I don't have my "Mastering Regular Expressions" handy but there you can
also find techniques for "loop unrolling" which should help in this case.

Btw, you don't need to escape the dot in the character class. It does
not have special meaning there.

Kind regards

robert

Click to expand...

Thank you for your optimization,but I think there's a little problem
with it .For example,when the value is
'2.efgabc pics' it won't match the original one but will match the
optimized one.I think adding '\d{3}' after the asteroid would be ok.

Click to expand...

What asteroid? Is an impact imminent? Do we have to duck and cover?

SCNR

Seriously: I thought with your initial portion of the RX you wanted to
match something like "1,000,000", "1,000" and "10" (plus "." as
separator as well as no separator) although I realize that your original
does also match "1,0,0,00,0" etc. Your suggested "fix" won't make the
two RX more similar with regard to matching because mine always matches
exactly three digits between a separator. Since I don't know what you
are actually trying to match I'm out of suggestions at the moment.

Kind regards

robert

how to solve this regular expression problem	6	Mar 31, 2010
Regular Expression help	5	Mar 29, 2008
a regular expression problem	1	Oct 10, 2008
Picture Comparison Code Not Working Properly	1	Jul 24, 2021
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
Regular expression problem	1	Jun 22, 2008
Record-separator is a regular expression	8	Nov 29, 2005
Regular expression problem	7	Mar 24, 2006

a regular expression problem

cppasm

Bill Kelly

cppasm

Bill Kelly

Robert Klemme

cppasm

Robert Klemme

cppasm

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads