catastrophic regexp, help!

cirfu · Jun 11, 2008

pat = re.compile("(\w* *)*")
this matches all sentences.
if fed the string "are you crazy? i am" it will return "are you
crazy".

i want to find a in a big string a sentence containing Zlatan
Ibrahimovic and some other text.
ie return the first sentence containing the name Zlatan Ibrahimovic.

patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")
should do this according to regexcoach but it seems to send my
computer into 100%CPU-power and not closable.

Maric Michaud · Jun 11, 2008

Le Wednesday 11 June 2008 06:20:14 cirfu, vous avez écrit :

pat = re.compile("(\w* *)*")
this matches all sentences.
if fed the string "are you crazy? i am" it will return "are you
crazy".

i want to find a in a big string a sentence containing Zlatan
Ibrahimovic and some other text.
ie return the first sentence containing the name Zlatan Ibrahimovic.

patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")
should do this according to regexcoach but it seems to send my
computer into 100%CPU-power and not closable.

This kind of regexp are quite often harmfull, while perfectly valid, if you
take the time it will return, this check too many things to be practical.

Read it, sequentially to make it sensible : for each sequence of word + space,
trying with the longest first, does the string 'zlatan' follow ?

"this is zlatan example.'
compare with 'this is zlatan example', 'z'=='.', false
compare with 'this is zlatan ', 'z'=='e', false
compare with 'this is zlatan', 'z'==' ', false
compare with 'this is ', "zlatan"=="zlatan", true
compare with 'this is', 'z'==' ', false
compare with 'this ', 'z'=='i', false
compare with 'this', 'z'==' ', false
...

ouch !

The most simple are your regex, better they are, two short regex are better
then one big, etc...
Don't do premature optimization (especially with regexp).

In [161]: s="""pat = re.compile("(\w* *)*")
this matches all sentences.
if fed the string "are you crazy? i am" it will return "are you
crazy".
i want to find a in a big string a sentence containing Zlatan
Ibrahimovic and some other text.
ie return the first sentence containing the name Zlatan Ibrahimovic.
patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")
should do this according to regexcoach but it seems to send my
computer into 100%CPU-power and not closable.
"""

In [172]: list(e[0] for e in re.findall("((\w+\s*)+)", s, re.M) if
re.findall('zlatan\s+ibrahimovic', e[0], re.I))
Out[172]:
['i want to find a in a big string a sentence containing Zlatan\nIbrahimovic
and some other text',
'ie return the first sentence containing the name Zlatan Ibrahimovic',
'zlatan ibrahimovic ']

Maric Michaud · Jun 11, 2008

Le Wednesday 11 June 2008 09:08:53 Maric Michaud, vous avez écrit :

"this is zlatan example.'
compare with 'this is zlatan example', 'z'=='.', false
compare with 'this is zlatan ', 'z'=='e', false
compare with 'this is zlatan', 'z'==' ', false
compare with 'this is ', "zlatan"=="zlatan", true

Ah no ! it stops here, but would have continued on the entire string upto the
empty string if it doesn't contain zlatan at all.

Chris · Jun 11, 2008

pat = re.compile("(\w* *)*")
this matches all sentences.
if fed the string "are you crazy? i am" it will return "are you
crazy".

i want to find a in a big string a sentence containing Zlatan
Ibrahimovic and some other text.
ie return the first sentence containing the name Zlatan Ibrahimovic.

patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")
should do this according to regexcoach but it seems to send my
computer into 100%CPU-power and not closable.

Maybe something like this would be of use...

def sentence_locator(s, sub):
cnt = s.upper().count(sub.upper())
if not cnt:
return None
tmp = []
idx = -1
while cnt:
idx = s.upper().find(sub.upper(), (idx+1))
a = -1
while True:
b = s.find('.', (a+1), idx)
if b == -1:
b = s.find('.', idx)
if b == -1:
tmp.append(s[a+1:])
break
tmp.append(s[a+1:b+1])
break
a = b
cnt -= 1
return tmp

TheSaint · Jun 11, 2008

patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")

I think that I shouldn't put anything around the phrase you want to find.

patzln = re.compile(r'.*(zlatan ibrahimovic){1,1}.*')

this should do it for you. Unless searching into a special position.

In the other hand, I'd like to understand how I can substitute a variable
inside a pattern.

if I do:
import os, re
EOL= os.linesep

re_EOL= re.compile(r'[?P<EOL>\s+2\t]'))

for line in open('myfile','r').readlines():
print re_EOL.sub('',line)

Will it remove tabs, spaces and end-of-line ?
It's doing but no EOL

cirfu · Jun 12, 2008

patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")

Click to expand...

I think that I shouldn't put anything around the phrase you want to find.

patzln = re.compile(r'.*(zlatan ibrahimovic){1,1}.*')

this should do it for you. Unless searching into a special position.

In the other hand, I'd like to understand how I can substitute a variable
inside a pattern.

if I do:
import os, re
EOL= os.linesep

re_EOL= re.compile(r'[?P<EOL>\s+2\t]'))

for line in open('myfile','r').readlines():
print re_EOL.sub('',line)

Will it remove tabs, spaces and end-of-line ?
It's doing but no EOL

it returns all the sentences. i just want the one containing zlatan
ibrahimovic.

cirfu · Jun 12, 2008

pat = re.compile("(\w* *)*")
this matches all sentences.
if fed the string "are you crazy? i am" it will return "are you
crazy".

Click to expand...

i want to find a in a big string a sentence containing Zlatan
Ibrahimovic and some other text.
ie return the first sentence containing the name Zlatan Ibrahimovic.

Click to expand...

patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")
should do this according to regexcoach but it seems to send my
computer into 100%CPU-power and not closable.

Click to expand...

Maybe something like this would be of use...

def sentence_locator(s, sub):
cnt = s.upper().count(sub.upper())
if not cnt:
return None
tmp = []
idx = -1
while cnt:
idx = s.upper().find(sub.upper(), (idx+1))
a = -1
while True:
b = s.find('.', (a+1), idx)
if b == -1:
b = s.find('.', idx)
if b == -1:
tmp.append(s[a+1:])
break
tmp.append(s[a+1:b+1])
break
a = b
cnt -= 1
return tmp

yes, seems very unpythonic though

must be a simpler way that isnt slow as hell.

alfasub000 · Jun 12, 2008

Maybe something like this would be of use...

Click to expand...

def sentence_locator(s, sub):
cnt = s.upper().count(sub.upper())
if not cnt:
return None
tmp = []
idx = -1
while cnt:
idx = s.upper().find(sub.upper(), (idx+1))
a = -1
while True:
b = s.find('.', (a+1), idx)
if b == -1:
b = s.find('.', idx)
if b == -1:
tmp.append(s[a+1:])
break
tmp.append(s[a+1:b+1])
break
a = b
cnt -= 1
return tmp

Click to expand...

yes, seems very unpythonic though
must be a simpler way that isnt slow as hell.

Why wouldn't you use character classes instead of groups? i.e:

pat = re.compile(r'([ \w]*Zlatan Ibrahimivoc[ \w]*)')
sentence = re.match(text).groups()

As has been mentioned earlier, certain evil combinations of regular
expressions and groups will cause python's regular expression engine
to go (righteously) crazy as they require the internal state machine
to branch out exponentially.

small regexp help	1	Oct 30, 2013
Regexp help required please	7	Jul 12, 2009
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
Regexp - start and end of line or string	1	Jan 16, 2011
Help with a regexp	35	Jul 12, 2006
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Can anyone write this recursion for simple regexp more beautifullyand clearly than the braggarts	157	Aug 29, 2009
anybody help me	1	Feb 10, 2006

catastrophic regexp, help!

cirfu

Maric Michaud

Maric Michaud

Chris

TheSaint

cirfu

cirfu

alfasub000

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads