Matching XML Tag Contents with Regex

Chris · Dec 11, 2007

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris

garage · Dec 11, 2007

Is what I'm trying to do possible with Python's Regex library? Is

there an error in my Regex?

Search for '*?' on http://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

harvey.thomas · Dec 11, 2007

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris

print re.findall(r'<%s(?=[\s/>])[^>]*>' % 'div', r)

["<div class='default'>", "<div class='default'>", "<div
class='default'>"]

HTH

Harvey

Chris · Dec 11, 2007

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Click to expand...

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Diez B. Roggisch · Dec 11, 2007

Chris said:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Click to expand...

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

Click to expand...

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez

Chris · Dec 11, 2007

Chris said:
Chris said:

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg
<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*
There's still some funkiness in the regex and logic, but this gives
you the three matches

Click to expand...

Click to expand...

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

Click to expand...

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Click to expand...

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez

I was hoping a simple pattern like <tag>.*text.*</tag> wouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.

Diez B. Roggisch · Dec 11, 2007

I was hoping a simple pattern like <tag>.*text.*</tag> wouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.

That's one of the common problems with rexes and XML/HTML. They start
out fast and easy, but at some point they blow up - or fail to fulfill
the task.

Diez

Having trouble centering contents of td ?	3	May 2, 2023
Help with code	0	Jun 12, 2022
Php modal form to email	1	Aug 28, 2024
Trouble with signup PHP	2	Aug 9, 2024
Only one table shows up with the information	2	Mar 29, 2023
How to display input options only after selecting an option from the 'select class' tag JS?	6	May 12, 2023
Help with popup display	1	Jan 6, 2023
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024

Matching XML Tag Contents with Regex

Chris

garage

harvey.thomas

Chris

Diez B. Roggisch

Chris

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads