re question - finiding matching ()

Miki Tebeka · Jan 18, 2004

Hello All,

To all of you regexp gurus out there...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2), 100)"),

Currently I can only get the "addr((2 *2)" using re.compile("\w+\([^\)]*\)").
To solve the problem a hand crafted search is used :-(

Is there a better way?

Thanks.
Miki

Mark McEahern · Jan 18, 2004

Hello All,

To all of you regexp gurus out there...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2), 100)"),

The pattern you have is so simple and the necessary regex (it seems to
me) so unnecessarily complex (for a 100% regex solution) that I think
using only regex's for this problem is overkill. Anyway, here's a
test-based approach:

#!/usr/bin/env python

import re
import unittest

def findOperands(verb, text):
"""Find operands in text for verb.

E.g., findOperands('add', 'add(1, 2, 3)') == ['1', '2', '3']

"""
raw = '%s\((.*)\)' % (verb,)
pat = re.compile(raw)
matches = pat.findall(text)
if not matches:
raise RuntimeError('No matches found for pattern %s.' % (raw,))
assert len(matches) == 1
match = matches[0]
operands = match.split(',')
for i, item in enumerate(operands):
operands = item.strip()
return operands

class test(unittest.TestCase):

def test(self):
text = 'add(2, 3)'
operands = findOperands('add', text)
expected = ['2', '3']
self.assertEquals(operands, expected)

text = 'add((2 * 2), 100)'
operands = findOperands('add', text)
expected = ['(2 * 2)', '100']
self.assertEquals(operands, expected)

text = 'add(1, 2, 3)'
operands = findOperands('add', text)
expected = ['1', '2', '3']
self.assertEquals(operands, expected)

text = 'multiply(1, 2, 3, 4)'
operands = findOperands('multiply', text)
expected = ['1', '2', '3', '4']
self.assertEquals(operands, expected)

text = 'add 2, 3'
self.assertRaises(RuntimeError, findOperands, 'add', text)

unittest.main()

Jeff Epler · Jan 18, 2004

Regular Expressions cannot perform the simple task of matching an
arbitrary number of parentheses. You could write an expression that
will work as long as the nesting isn't more than N levels, but the
expression quickly becomes very painful.

Instead, you can use some method to split the string into parts: one
part for "(", one for ")" and one for any other sequence of characters:

tokenize_re = re.compile("\(|\)|[^()]*")

Now, use this expression to tokenize your input string:
['add', '(', '(', '2 * 2', ')', ', 100', ')']
To match your language, the first token must be 'add':
if tokens[0] != 'add': # no match
The second token must be '(':
if tokens[1] != '(': # no match
Now, start scanning and counting parens, until you get back to 0
nesting_level = 0
for t in tokens[1:]:
if t == ')':
nesting_level -= 1
if nesting_level == 0:
# End of desired string
else if t == '(':
nesting_level += 1
# if you end the loop without reaching nesting_level 0
# then there were not enough close-parens
# no match
You could also implement this part recursively (as you likely would if
you were actually compiling the string into bytecode).

Jeff

Jeff Epler · Jan 18, 2004

Perhaps you'd like to enhance your test-suite:
# Do improperly-nested parens cause an exception? (part 1)
text = 'add((1,2,3)'
self.assertRaises(RuntimeError, findOperands, 'add', text)

# Do improperly-nested parens cause an exception? (part 2)
text = 'add(1,2,3))'
self.assertRaises(RuntimeError, findOperands, 'add', text)

# Do multiple statements on one line behave as expected?
text = 'add(1,2); add(1,2)'
expected = ['2', '3']
self.assertEquals(operands, expected)

Jeff

Christophe Delord · Jan 18, 2004

Hello All,

To all of you regexp gurus out there...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2),
100)"),

Currently I can only get the "addr((2 *2)" using
re.compile("\w+\([^\)]*\)"). To solve the problem a hand crafted
search is used :-(

Is there a better way?

Thanks.
Miki

Hello,

You may need "recursive patterns" to do this but regular expressions
cannot handle this. You can simulate recursive pattern by limiting the
recursivity level. For example the expression inside () should be
[^\(\)]+ at level 0. At level 1, you can match only zero or one pair:
(?:\([^\(\)]+\)|[^\(\)]+)* and so on.

You can build such an expression recursively:

def make_par_re(level=6):
if level < 1:
return r'[^\(\)]+'
else:
return r'(?:\(%s\)|%s)*'%(make_par_re(level-1), make_par_re(0))

par_re = re.compile(r"\w+\(%s\)"%make_par_re())

But in this case you are limited to 6 levels.

Now you can try this :

for m in par_re.findall("add((2*2), 100) some text sub(a, b*(10-c),
f(g(a,b), h(c, d)))"):
print m

I don't really like this solution because the expressions are ugly (try
print make_par_re(6)).

Anyway a better solution would be to use a syntactic parser. You can
write your own by hand or make your choice here:
http://www.python.org/sigs/parser-sig/

Best regards,
Christophe.

Peter Otten · Jan 18, 2004

Miki said:
To all of you regexp gurus out there...

So not asking me, but anyway...

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2),
100)"),

Currently I can only get the "addr((2 *2)" using
re.compile("\w+\([^\)]*\)"). To solve the problem a hand crafted search is
used :-(

Is there a better way?

Iff you are looking into Python code, you could also use the compiler
module. The following script scans itself for occurences of add() function
calls.

<example.py>
import compiler

def sample():
""" add(xxx) <- will not be found """
x.add(1) # <- will not be found
add(a*(b+c))
add(a, b)
add()
add((1*1),2)+add((2))

class Visitor:
def visitCallFunc(self, node):
if getattr(node.getChildren()[0], "name", None) == "add":
print node

tree = compiler.parseFile(__file__)

compiler.visitor.walk(tree, Visitor())
</example.py>

Peter

Paul Rubin · Jan 18, 2004

I'd like to find all of the sub strings in the form "add(.*)"
The catch is that I might have () in the string (e.g. "add((2 * 2), 100)"),

You can't do that with classic regexps. You need a parser, not a scanner.

Dan Bishop · Jan 18, 2004

Jeff Epler said:
Regular Expressions cannot perform the simple task of matching an
arbitrary number of parentheses. You could write an expression that
will work as long as the nesting isn't more than N levels, but the
expression quickly becomes very painful.

Instead, you can use some method to split the string into parts: one
part for "(", one for ")" and one for any other sequence of characters: ....
Now, start scanning and counting parens, until you get back to 0

That's the way I'd recommend. However, it doesn't generalize to cases
where there are multiple types of parentheses. For that situation,
you can use:

LEFT_PARENS = '([{'
RIGHT_PARENS = ')]}'
PAREN_MATCHES = dict(zip(RIGHT_PARENS, LEFT_PARENS))

def balancedParens(tokens):
parenStack = []
for token in tokens:
if token in LEFT_PARENS:
parenStack.append(token)
elif token in RIGHT_PARENS:
if parenStack:
correspondingLeftParen = parenStack.pop()
if PAREN_MATCHES[token] != correspondingLeftParen:
return False
else:
return False
return True

(There's probably a simpler way, but I can't think of one right now.)

Paul McGuire · Jan 19, 2004

Christophe Delord said:
On 18 Jan 2004 07:51:38 -0800, Miki Tebeka wrote:

Anyway a better solution would be to use a syntactic parser. You can
write your own by hand or make your choice here:
http://www.python.org/sigs/parser-sig/

Best regards,
Christophe.

Another parser not listed on this page can be found at
http://pyparsing.sourceforge.net. The download includes an algebraic
expression parser, that handles parenthesis nesting.

-- Paul

Samuel Walters · Jan 19, 2004

| Paul Rubin said |

You can't do that with classic regexps. You need a parser, not a scanner.

Or, to put it slightly differently:

Regexps are a type of system that is not mathematically powerful enough to
handle this type of identification. The next step up in power are
parsers.

Sam Walters.

Using re.VERBOSE, and re-using components of regex?	1	Apr 17, 2013
Unicode: matching a	0	Nov 15, 2007
Weird problem matching with REs	11	May 29, 2011
Question about my projects	3	Jul 23, 2021
re question	2	Feb 7, 2008
Tasks	1	Nov 29, 2022
Unicode: matching a word and unaccenting characters	2	Nov 15, 2007
Get all strings matching given RegExp	3	Apr 3, 2008

re question - finiding matching ()

Miki Tebeka

Mark McEahern

Jeff Epler

Jeff Epler

Christophe Delord

Peter Otten

Paul Rubin

Dan Bishop

Paul McGuire

Samuel Walters

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads