re question - finiding matching ()

M

Miki Tebeka

Hello All,

To all of you regexp gurus out there...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2), 100)"),

Currently I can only get the "addr((2 *2)" using re.compile("\w+\([^\)]*\)").
To solve the problem a hand crafted search is used :-(

Is there a better way?

Thanks.
Miki
 
M

Mark McEahern

Hello All,

To all of you regexp gurus out there...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2), 100)"),

The pattern you have is so simple and the necessary regex (it seems to
me) so unnecessarily complex (for a 100% regex solution) that I think
using only regex's for this problem is overkill. Anyway, here's a
test-based approach:

#!/usr/bin/env python

import re
import unittest

def findOperands(verb, text):
"""Find operands in text for verb.

E.g., findOperands('add', 'add(1, 2, 3)') == ['1', '2', '3']

"""
raw = '%s\((.*)\)' % (verb,)
pat = re.compile(raw)
matches = pat.findall(text)
if not matches:
raise RuntimeError('No matches found for pattern %s.' % (raw,))
assert len(matches) == 1
match = matches[0]
operands = match.split(',')
for i, item in enumerate(operands):
operands = item.strip()
return operands

class test(unittest.TestCase):

def test(self):
text = 'add(2, 3)'
operands = findOperands('add', text)
expected = ['2', '3']
self.assertEquals(operands, expected)

text = 'add((2 * 2), 100)'
operands = findOperands('add', text)
expected = ['(2 * 2)', '100']
self.assertEquals(operands, expected)

text = 'add(1, 2, 3)'
operands = findOperands('add', text)
expected = ['1', '2', '3']
self.assertEquals(operands, expected)

text = 'multiply(1, 2, 3, 4)'
operands = findOperands('multiply', text)
expected = ['1', '2', '3', '4']
self.assertEquals(operands, expected)

text = 'add 2, 3'
self.assertRaises(RuntimeError, findOperands, 'add', text)

unittest.main()
 
J

Jeff Epler

Regular Expressions cannot perform the simple task of matching an
arbitrary number of parentheses. You could write an expression that
will work as long as the nesting isn't more than N levels, but the
expression quickly becomes very painful.

Instead, you can use some method to split the string into parts: one
part for "(", one for ")" and one for any other sequence of characters:

tokenize_re = re.compile("\(|\)|[^()]*")

Now, use this expression to tokenize your input string:
['add', '(', '(', '2 * 2', ')', ', 100', ')']
To match your language, the first token must be 'add':
if tokens[0] != 'add': # no match
The second token must be '(':
if tokens[1] != '(': # no match
Now, start scanning and counting parens, until you get back to 0
nesting_level = 0
for t in tokens[1:]:
if t == ')':
nesting_level -= 1
if nesting_level == 0:
# End of desired string
else if t == '(':
nesting_level += 1
# if you end the loop without reaching nesting_level 0
# then there were not enough close-parens
# no match
You could also implement this part recursively (as you likely would if
you were actually compiling the string into bytecode).

Jeff
 
J

Jeff Epler

Perhaps you'd like to enhance your test-suite:
# Do improperly-nested parens cause an exception? (part 1)
text = 'add((1,2,3)'
self.assertRaises(RuntimeError, findOperands, 'add', text)

# Do improperly-nested parens cause an exception? (part 2)
text = 'add(1,2,3))'
self.assertRaises(RuntimeError, findOperands, 'add', text)

# Do multiple statements on one line behave as expected?
text = 'add(1,2); add(1,2)'
expected = ['2', '3']
self.assertEquals(operands, expected)

Jeff
 
C

Christophe Delord

Hello All,

To all of you regexp gurus out there...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2),
100)"),

Currently I can only get the "addr((2 *2)" using
re.compile("\w+\([^\)]*\)"). To solve the problem a hand crafted
search is used :-(

Is there a better way?

Thanks.
Miki


Hello,

You may need "recursive patterns" to do this but regular expressions
cannot handle this. You can simulate recursive pattern by limiting the
recursivity level. For example the expression inside () should be
[^\(\)]+ at level 0. At level 1, you can match only zero or one pair:
(?:\([^\(\)]+\)|[^\(\)]+)* and so on.

You can build such an expression recursively:

def make_par_re(level=6):
if level < 1:
return r'[^\(\)]+'
else:
return r'(?:\(%s\)|%s)*'%(make_par_re(level-1), make_par_re(0))

par_re = re.compile(r"\w+\(%s\)"%make_par_re())

But in this case you are limited to 6 levels.

Now you can try this :

for m in par_re.findall("add((2*2), 100) some text sub(a, b*(10-c),
f(g(a,b), h(c, d)))"):
print m

I don't really like this solution because the expressions are ugly (try
print make_par_re(6)).

Anyway a better solution would be to use a syntactic parser. You can
write your own by hand or make your choice here:
http://www.python.org/sigs/parser-sig/



Best regards,
Christophe.
 
P

Peter Otten

Miki said:
To all of you regexp gurus out there...

So not asking me, but anyway...
I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2),
100)"),

Currently I can only get the "addr((2 *2)" using
re.compile("\w+\([^\)]*\)"). To solve the problem a hand crafted search is
used :-(

Is there a better way?

Iff you are looking into Python code, you could also use the compiler
module. The following script scans itself for occurences of add() function
calls.

<example.py>
import compiler

def sample():
""" add(xxx) <- will not be found """
x.add(1) # <- will not be found
add(a*(b+c))
add(a, b)
add()
add((1*1),2)+add((2))

class Visitor:
def visitCallFunc(self, node):
if getattr(node.getChildren()[0], "name", None) == "add":
print node

tree = compiler.parseFile(__file__)

compiler.visitor.walk(tree, Visitor())
</example.py>

Peter
 
P

Paul Rubin

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2), 100)"),

You can't do that with classic regexps. You need a parser, not a scanner.
 
D

Dan Bishop

Jeff Epler said:
Regular Expressions cannot perform the simple task of matching an
arbitrary number of parentheses. You could write an expression that
will work as long as the nesting isn't more than N levels, but the
expression quickly becomes very painful.

Instead, you can use some method to split the string into parts: one
part for "(", one for ")" and one for any other sequence of characters: ....
Now, start scanning and counting parens, until you get back to 0

That's the way I'd recommend. However, it doesn't generalize to cases
where there are multiple types of parentheses. For that situation,
you can use:

LEFT_PARENS = '([{'
RIGHT_PARENS = ')]}'
PAREN_MATCHES = dict(zip(RIGHT_PARENS, LEFT_PARENS))

def balancedParens(tokens):
parenStack = []
for token in tokens:
if token in LEFT_PARENS:
parenStack.append(token)
elif token in RIGHT_PARENS:
if parenStack:
correspondingLeftParen = parenStack.pop()
if PAREN_MATCHES[token] != correspondingLeftParen:
return False
else:
return False
return True

(There's probably a simpler way, but I can't think of one right now.)
 
P

Paul McGuire

S

Samuel Walters

| Paul Rubin said |
You can't do that with classic regexps. You need a parser, not a scanner.

Or, to put it slightly differently:

Regexps are a type of system that is not mathematically powerful enough to
handle this type of identification. The next step up in power are
parsers.

Sam Walters.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,175
Messages
2,570,946
Members
47,497
Latest member
PilarLumpk

Latest Threads

Top