re.match -- not greedy?

E

EXI-Andrews, Jack

the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:
('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

in real life, i am trying to match #defines:
re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.

so what's the definition of greedy?

(pls copy me on responses)

ta,


jack
 
C

Carl Banks

the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:

('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

It's called backtracking. If a match fails, the regexp engine
recursively unwinds some of the greedy matching and tries again.
in real life, i am trying to match #defines:
re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.

Probably the easiest thing to do is to attempt to match zero or one
opening parentheses, and bail out if it did end up matching the
parenthesis.

m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
if m and m.groups(2) != "(":
...

(BTW, it's good practice to use a raw strings for the regular
expressions as I did.)
so what's the definition of greedy?

It means match as much you can *while getting a match*.

Just remember regexp parsing is not a one way street: if it hits a dead
end, it'll go back and try another path. In fact, greediness usually
doesn't affect *whether* a regexp matches; it only affects the
groupings. I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.
(pls copy me on responses)

Ah, too bad you won't see it then....


Carl Banks
 
G

Guest

* Carl Banks said:
I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.

nope, per definitionem ;-)

greedy := longest leftmost match

nd
 
N

Noah Rawlins

Carl said:
the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:

('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

It's called backtracking. If a match fails, the regexp engine
recursively unwinds some of the greedy matching and tries again.
in real life, i am trying to match #defines:
re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.

Probably the easiest thing to do is to attempt to match zero or one
opening parentheses, and bail out if it did end up matching the
parenthesis.

m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
if m and m.groups(2) != "(":
...

(BTW, it's good practice to use a raw strings for the regular
expressions as I did.)

Another way that seems clearer to me is to use negative lookahead
assertions.
('foo_BarBaz',)

In general '\w' is the same as [A-Za-z0-9_]

There are other considerations too... I don't know if

#define abc (x)

But the main thing here is the use of '\b' to require there to be a word
boundary at the end your match and a (?!\() to require that the match is
not followed by a '('

see http://docs.python.org/lib/re-syntax.html

noah
 
P

Paul McGuire

in real life, i am trying to match #defines:
re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.
Here's a pyparsing grammar that matches your #defines as you describe (skip
over #defines with parameters), and also skips over ones that are commented
out. It also parses the rest of the definition line (definitions with
end-of-line '\' continuations not included in this sample):

from pyparsing import *
defWithoutParen = "#define" + \
Word(alphas,alphanums+"_") + \
~Literal("(").leaveWhitespace() + \
Optional(restOfLine)
defWithoutParen.ignore(cStyleComment)

Here's sample code showing how to use this grammar:

sample = """
#define PI 3.141592
#define SUCCESS (href==1)
#define MIN(a,b) a<b ? a : b
#define E 2.71828
/*
#define lockMutex mockLockMutex
*/
#define WIN32
"""
matches = defWithoutParen.searchString(sample)
for m in matches:
print m

Prints out:
['#define', 'PI', '3.141592']
['#define', 'SUCCESS', '(href==1)']
['#define', 'E', '2.71828']
['#define', 'WIN32', '']

-- Paul
 
F

Fredrik Lundh

Carl said:
> I'm suddenly curious if there are any cases at all where greediness
> changes whether it finds a match.

the "|" operator picks the leftmost subpattern that matches in any way,
so the *overall* greediness of a Python RE can vary. e.g.

re.match("(ab?a|a.a+)", "abaaaaaaaa")

matches three characters, not ten.

</F>
 
A

Ant

the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:

('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

They are greedy - they consume as much as possible *without*
sacrificing the match.

You are thinking I think of possessive quantifiers, such as found in
Java, which *will* match as much as possible even if doing so causes
the match to fail. This would be written:

re.match('(a++)(ab)','aaab')

Python doesn't support these possessive quantifiers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top