Matching XML Tag Contents with Regex

C

Chris

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris
 
G

garage

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' on http://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches
 
H

harvey.thomas

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris

print re.findall(r'<%s(?=[\s/>])[^>]*>' % 'div', r)

["<div class='default'>", "<div class='default'>", "<div
class='default'>"]

HTH

Harvey
 
C

Chris

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.
 
D

Diez B. Roggisch

Chris said:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez
 
C

Chris

Chris said:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg
<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*
There's still some funkiness in the regex and logic, but this gives
you the three matches
Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:
<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez

I was hoping a simple pattern like <tag>.*text.*</tag> wouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.
 
D

Diez B. Roggisch

I was hoping a simple pattern like <tag>.*text.*</tag> wouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.

That's one of the common problems with rexes and XML/HTML. They start
out fast and easy, but at some point they blow up - or fail to fulfill
the task.

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top