ruby script hangs on regex match

  • Thread starter Adrian Petru Dimulescu
  • Start date
A

Adrian Petru Dimulescu

--------------010804070103040002070800
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Hello,

I have a regex infinte loop kind of problem. I use ruby 1.8.2. The
regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*(proteins|genes|protein|gene))

I try to match this regex against the string, without the quotes (the
following should be a whole single line):

"to this end , NK_CTL clones derived from four donors ( KK , GG , GF ,
and DP ) were tested for their ability to lyse the TAP2_deficient
RMA_S\HLA_E cell_line incubated with serial_dilutions of the VMAPRTLIL ,
VMAPRTLVL , VMAPRTLLL , and VMAPRALLL peptides ."

Normally the regex is not supposed to match against this particular
string. What it does, it hangs with 100% CPU consumption, while for lots
of other lines of text it works ok. Am I doing something wrong?

I tried doing the same regex in perl, it complained about the string
having the \H unknown control sequence (it appears indeed in the text at
some point). If I replace the "\H" by, let's say, "-H", the regular
expression passes through without finding anything in Perl -- which is
normal, as I said. Ruby hangs even if I change the "\" in "-".

Needless to say I would much appreciate some help on this one. Feel free
to ask for explanation of that complicated regex if needed or any other
information.

As attachment, the ruby script that hangs.

Best regards,
Adrian Dimulescu.


--------------010804070103040002070800
Content-Type: text/plain;
name="regex-hangs.rb"
Content-Transfer-Encoding: base64
Content-Disposition: inline;
filename="regex-hangs.rb"

IyEgL3Vzci9iaW4vcnVieSAtdwoKc3RyaW5nID0gJ3RvIHRoaXMgZW5kICwgTktfQ1RMIGNs
b25lcyBkZXJpdmVkIGZyb20gZm91ciBkb25vcnMgKCBLSyAsIEdHICwgR0YgLCBhbmQgRFAg
KSB3ZXJlIHRlc3RlZCBmb3IgdGhlaXIgYWJpbGl0eSB0byBseXNlIHRoZSBUQVAyX2RlZmlj
aWVudCBSTUFfU1xITEFfRSBjZWxsX2xpbmUgaW5jdWJhdGVkIHdpdGggc2VyaWFsX2RpbHV0
aW9ucyBvZiB0aGUgVk1BUFJUTElMICwgVk1BUFJUTFZMICwgVk1BUFJUTExMICwgYW5kIFZN
QVBSQUxMTCBwZXB0aWRlcyAuJwpzdHJpbmcgPX4gL1t0VF1oZVxzKygoW1x3XGRcX10rKD86
W2EtekEtWl1bXGRBLVpdfFtcZEEtWl1bYS16QS1aXSlbXHdcZFxfXSooKFxzKihcLHxhbmR8
b3IpXHMqKSpbXHdcZFxfXSsoPzpbYS16QS1aXVtcZEEtWl18W1xkQS1aXVthLXpBLVpdKVtc
d1xkXF9dKikqKSlccysoKFx3K1xzKyl7MCwzfVxzKihwcm90ZWluc3xnZW5lc3xwcm90ZWlu
fGdlbmUpKS8KcHJpbnQgImkgZm91bmQ6ICIgKyAkJg==
--------------010804070103040002070800--
 
D

Daniel Brockman

Hi Adrian,
I have a regex infinte loop kind of problem. I use ruby
1.8.2. The regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])
[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\dA-Z]
|[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*
(proteins|genes|protein|gene))

I just wanted to point out that if you add the /x flag to
the regexp, you can insert whitespace and comments at will.
That can sometimes make the expression pseudo-readable.
 
W

Wolfgang Nádasi-Donner

snip >>>>>
..
((\s*(\,|and|or)\s*)*
..
..
..
(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))
..
When I look at this parts of your Regex, I see constructs like (a*|b*)* - These constructs consume some
billion of years if applied to a string where no match is possible. Take a closer look to "the Friedl" for
avoiding situations like this.

Maybe I'm wrong, but there are severe problems to read the Regex.

Best Regards, Wolfgang Nadasi-Donner
 
A

Adrian Petru Dimulescu

Daniel said:
I just wanted to point out that if you add the /x flag to
the regexp, you can insert whitespace and comments at will.
That can sometimes make the expression pseudo-readable.
Hello Daniel,

thanks for pointing that out, I didn't know it.

I will quickly explain this regex: it's supposed to match strings like:

"the hek2p , a2p , and b3p multidomain proteins".

A protein usually contains a digit inside or an uppercase.

Here is the whole regex:

[tT]he => the
\s+

(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]* => a protein
((\s*(\,|and|or)\s*)* => a comma-separator, "and", "or"
[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*)) => another protein (this with the preceding make up a protein comma-separated sequence)

\s+

((\w+\s+){0,3}\s* => some adjectives (at most 3)

(proteins|genes|protein|gene))



I'll also detail the protein regex which occurs 2 times in the big regex:

([\w\d\_]+ => some letters or digits
(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z]) -> necessarily a digit or uppercase but with at least a letter before or after
[\w\d\_]* => some other letters or digits

the complexity of the second part of the protein regex is in order to avoid matching a simple number (only digits)


Here it is, I hope it's clearer. Please tell me if I should file a bug.


Best regards,
Adrian.
 
A

Adrian Petru Dimulescu

Wolfgang said:
When I look at this parts of your Regex, I see constructs like (a*|b*)*=
- These constructs consume some
billion of years if applied to a string where no match is possible. Take=
a closer look to "the Friedl" for
avoiding situations like this.
=20
Hello,

I am sure this regex is far from being optimized as I am really no
expert in regular expressions. It is also more like the kind of script
that only gets executed once.

But I do pass the regex onto thousands of lines and many of those do not
match. The not-matching happens very fast (hundreds of phrases treated
per second) - until that specific one.

Best regards,
Adrian.
 
W

William James

Adrian said:
Hello,

I have a regex infinte loop kind of problem. I use ruby 1.8.2. The
regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*(proteins|genes|protein|gene))

I try to match this regex against the string, without the quotes (the
following should be a whole single line):

"to this end , NK_CTL clones derived from four donors ( KK , GG , GF ,
and DP ) were tested for their ability to lyse the TAP2_deficient
RMA_S\HLA_E cell_line incubated with serial_dilutions of the VMAPRTLIL ,
VMAPRTLVL , VMAPRTLLL , and VMAPRALLL peptides ."



regexp =
/ # "The " or "the "
[Tt]he\s+
# Protein consists of letters & digits. Must not be only digits.
(?!\d+\s)
[A-Za-z\d]+\s+
(
# ", "
,\s+
# Optional "and "
(and\s+)?
# Protein consists of letters & digits. Must not be only digits.
(?!\d+\s)
[A-Za-z\d]+\s+
)*
# 0--3 adjectives
([A-Za-z]+\s+){0,3}
(protein|gene)s?
/x
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top