regexp search question

P

Paul Rubin

I have a string s, possibly megabytes in size, and two regexps, p and q.

I want to find the first occurence of q that occurs after the first
occurence of p.

Is there a reasonable way to do it?

g1 = re.search(p, s)
g2 = re.search(q, s[g1.end():])
q_offset = g1.end() + g2.start()

is not a reasonable way, since it copies a ton of data around
(slicing an arbitrary sized chunk off s into a new temporary string).

Most regexps libs I know of have a way to start the search at a
specified offset. Python's string.find and string.index methods
have a similar optional arg. But I don't see it described in the
re module docs.

Am I missing something?

Thanks.
 
F

Francis Avila

Paul Rubin said:
I have a string s, possibly megabytes in size, and two regexps, p and q.

I want to find the first occurence of q that occurs after the first
occurence of p.

Is there a reasonable way to do it?

g1 = re.search(p, s)
g2 = re.search(q, s[g1.end():])
q_offset = g1.end() + g2.start()


is not a reasonable way, since it copies a ton of data around
(slicing an arbitrary sized chunk off s into a new temporary string).

Most regexps libs I know of have a way to start the search at a
specified offset. Python's string.find and string.index methods
have a similar optional arg. But I don't see it described in the
re module docs.

Am I missing something?

Yes: you can specify an offset, but only in the search METHOD (of re
objects), not the search function (for that, you just use slicing of the
string, see?)


Alternative 1:
Instead of slicing the string, make a buffer object that references to a
slice of the string (using the buffer() builtin)
NOTE: Don't do this!

Alternative 2:
Compile a regular expression object for p and q, instead of doing a match.
Since I don't know the implementation details or re, I don't know if the
start/end args to REOBJECT.search will copy the string or use a buffer--so
that may not be different from what you're doing. However, compiling the re
will certainly be faster, if you do this search more than once.
(NOTE: untested code!)

p = re.compile(ppattern)
q = re.compile(qpattern)
matchp = p.search(somestring)
pend = matchp.end()
matchq = q.search(somestring, pend)
qstart = matchq.start()

Now I'm not sure if matchq.start() returns index from the substring or the
whole string. You'll just have to try it and see...

if counts from substring:
offset = matchq.pos + matchq.start() # == matchp.end() + matchq.start().
else:
offset = matchq.start()

Alternative 3:
You could probably combine p and q into a single regexp specifying that you
match p, then q, with anything inbetween. Using groups (p is grp 1, q is
grp 2), get your offset with matchpq.end(1) + matchpq.start(2)

There are probably many other ways.


No problem.
 
F

Francis Avila

Francis Avila said:
Alternative 3:
You could probably combine p and q into a single regexp specifying that you
match p, then q, with anything inbetween. Using groups (p is grp 1, q is
grp 2), get your offset with matchpq.end(1) + matchpq.start(2)

Gah, that's wrong: the offset of q will be in matchpq.start(2).
 
P

Paul Rubin

Francis Avila said:
Yes: you can specify an offset, but only in the search METHOD (of re
objects), not the search function (for that, you just use slicing of the
string, see?)

Thanks, this is what I wanted. I missed it when first looking at the
doc. I just need to compile the regexp separately. Slight nuisance
but no big deal.
 
D

Donald 'Paddy' McCarthy

Paul said:
I have a string s, possibly megabytes in size, and two regexps, p and q.

I want to find the first occurence of q that occurs after the first
occurence of p.

Is there a reasonable way to do it?

g1 = re.search(p, s)
g2 = re.search(q, s[g1.end():])
q_offset = g1.end() + g2.start()

is not a reasonable way, since it copies a ton of data around
(slicing an arbitrary sized chunk off s into a new temporary string).

Most regexps libs I know of have a way to start the search at a
specified offset. Python's string.find and string.index methods
have a similar optional arg. But I don't see it described in the
re module docs.

Am I missing something?

Thanks.

Can't you just combine the two regexps? for example if p='abc' and
q='stu', can't you compile and match against something like the following:
import re
pq=re.compile(r'abc.*?(stu)')
s=pq.search('aaass_abcsd_stuqwer_stu')
s.start(1)
Notice i used .*?, the non greedy match to return the first occurrence
of q after p.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,918
Members
47,458
Latest member
Chris#

Latest Threads

Top