regexp search question

Paul Rubin · Oct 23, 2003

I have a string s, possibly megabytes in size, and two regexps, p and q.

I want to find the first occurence of q that occurs after the first
occurence of p.

Is there a reasonable way to do it?

g1 = re.search(p, s)
g2 = re.search(q, s[g1.end():])
q_offset = g1.end() + g2.start()

is not a reasonable way, since it copies a ton of data around
(slicing an arbitrary sized chunk off s into a new temporary string).

Most regexps libs I know of have a way to start the search at a
specified offset. Python's string.find and string.index methods
have a similar optional arg. But I don't see it described in the
re module docs.

Am I missing something?

Thanks.

Francis Avila · Oct 23, 2003

Paul Rubin said:
I have a string s, possibly megabytes in size, and two regexps, p and q.

I want to find the first occurence of q that occurs after the first
occurence of p.

Is there a reasonable way to do it?

g1 = re.search(p, s)
g2 = re.search(q, s[g1.end():])
q_offset = g1.end() + g2.start()

is not a reasonable way, since it copies a ton of data around
(slicing an arbitrary sized chunk off s into a new temporary string).

Most regexps libs I know of have a way to start the search at a
specified offset. Python's string.find and string.index methods
have a similar optional arg. But I don't see it described in the
re module docs.

Am I missing something?

Yes: you can specify an offset, but only in the search METHOD (of re
objects), not the search function (for that, you just use slicing of the
string, see?)

Alternative 1:
Instead of slicing the string, make a buffer object that references to a
slice of the string (using the buffer() builtin)
NOTE: Don't do this!

Alternative 2:
Compile a regular expression object for p and q, instead of doing a match.
Since I don't know the implementation details or re, I don't know if the
start/end args to REOBJECT.search will copy the string or use a buffer--so
that may not be different from what you're doing. However, compiling the re
will certainly be faster, if you do this search more than once.
(NOTE: untested code!)

p = re.compile(ppattern)
q = re.compile(qpattern)
matchp = p.search(somestring)
pend = matchp.end()
matchq = q.search(somestring, pend)
qstart = matchq.start()

Now I'm not sure if matchq.start() returns index from the substring or the
whole string. You'll just have to try it and see...

if counts from substring:
offset = matchq.pos + matchq.start() # == matchp.end() + matchq.start().
else:
offset = matchq.start()

Alternative 3:
You could probably combine p and q into a single regexp specifying that you
match p, then q, with anything inbetween. Using groups (p is grp 1, q is
grp 2), get your offset with matchpq.end(1) + matchpq.start(2)

There are probably many other ways.

Thanks.

No problem.

Francis Avila · Oct 23, 2003

Francis Avila said:
Alternative 3:
You could probably combine p and q into a single regexp specifying that you
match p, then q, with anything inbetween. Using groups (p is grp 1, q is
grp 2), get your offset with matchpq.end(1) + matchpq.start(2)

Gah, that's wrong: the offset of q will be in matchpq.start(2).

Paul Rubin · Oct 23, 2003

Francis Avila said:
Yes: you can specify an offset, but only in the search METHOD (of re
objects), not the search function (for that, you just use slicing of the
string, see?)

Thanks, this is what I wanted. I missed it when first looking at the
doc. I just need to compile the regexp separately. Slight nuisance
but no big deal.

Donald 'Paddy' McCarthy · Oct 23, 2003

Paul said:
I have a string s, possibly megabytes in size, and two regexps, p and q.

I want to find the first occurence of q that occurs after the first
occurence of p.

Is there a reasonable way to do it?

g1 = re.search(p, s)
g2 = re.search(q, s[g1.end():])
q_offset = g1.end() + g2.start()

is not a reasonable way, since it copies a ton of data around
(slicing an arbitrary sized chunk off s into a new temporary string).

Most regexps libs I know of have a way to start the search at a
specified offset. Python's string.find and string.index methods
have a similar optional arg. But I don't see it described in the
re module docs.

Am I missing something?

Thanks.

Can't you just combine the two regexps? for example if p='abc' and
q='stu', can't you compile and match against something like the following:
import re
pq=re.compile(r'abc.*?(stu)')
s=pq.search('aaass_abcsd_stuqwer_stu')
s.start(1)
Notice i used .*?, the non greedy match to return the first occurrence
of q after p.

regexp search on infinite string?	6	Sep 14, 2007
help with regexp	5	Feb 7, 2013
newbie question: parse a variable inside an RE?	2	Dec 1, 2008
Need help finding Segmentation fault C++	0	Apr 16, 2022
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013
Chatbot	0	Oct 8, 2024
Simple regexp question	0	Oct 26, 2005
Regexp question	1	Dec 1, 2004

regexp search question

Paul Rubin

Francis Avila

Francis Avila

Paul Rubin

Donald 'Paddy' McCarthy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads