RegExp Help

Sean DiZazzo · Dec 14, 2007

Hi group,

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

I have the following example string, and am simply trying to split it
into two xml tags...

simplified = """2007-12-13 <tag1 attr1="text1" attr2="text2" /tag1>
\n2007-12-13 <tag2 attr1="text1" attr2="text2" attr3="text3\n" /tag2>
\n"""

Basically I want the two tags, and to discard anything in between
using a reg exp. Like this:

tags = ["<tag1 attr1="text1" attr2="text2" /tag1>", "<tag2
attr1="text1" attr2="text2" attr3="text3\n" /tag2>"]

I've tried several approaches, some of which got close, but the
newline in the middle of one of the tags screwed it up. The closest
I've been is something like this:

retag = re.compile(r'<.+>*') # tried here with re.DOTALL as well
tags = re.findall(retag)

Can anyone help me?

~Sean

Sean DiZazzo · Dec 14, 2007

Hi group,

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

I have the following example string, and am simply trying to split it
into two xml tags...

simplified = """2007-12-13 <tag1 attr1="text1" attr2="text2" /tag1>
\n2007-12-13 <tag2 attr1="text1" attr2="text2" attr3="text3\n" /tag2>
\n"""

Basically I want the two tags, and to discard anything in between
using a reg exp. Like this:

tags = ["<tag1 attr1="text1" attr2="text2" /tag1>", "<tag2
attr1="text1" attr2="text2" attr3="text3\n" /tag2>"]

I've tried several approaches, some of which got close, but the
newline in the middle of one of the tags screwed it up. The closest
I've been is something like this:

retag = re.compile(r'<.+>*') # tried here with re.DOTALL as well
tags = re.findall(retag)

Can anyone help me?

~Sean

I found something that works, although I couldn't tell you why it
works.

retag = re.compile(r'<.+?>', re.DOTALL)
tags = retag.findall(retag)

Why does that work?

~Sean

Marc 'BlackJack' Rintsch · Dec 14, 2007

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

[â€¦]

Can anyone help me?

Flaky XML is often produced by programs that treat XML as ordinary text
files. If you are starting to parse XML with regular expressions you are
making the very same mistake. XML may look somewhat simple but
producing correct XML and parsing it isn't. Sooner or later you stumble
across something that breaks producing or parsing the "naive" way.

Ciao,
Marc 'BlackJack' Rintsch

Sean DiZazzo · Dec 14, 2007

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

[...]

Click to expand...

Can anyone help me?

Click to expand...

Flaky XML is often produced by programs that treat XML as ordinary text
files. If you are starting to parse XML with regular expressions you are
making the very same mistake. XML may look somewhat simple but
producing correct XML and parsing it isn't. Sooner or later you stumble
across something that breaks producing or parsing the "naive" way.

Ciao,
Marc 'BlackJack' Rintsch

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean

alimuddin · Dec 14, 2007

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean

alimuddin · Dec 14, 2007

test

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean

alimuddin · Dec 14, 2007

testtest

alimuddin said:
It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean

There is no api, so this is what I have to work with. Is there a
better solution?
testa

Gabriel Genellina · Dec 14, 2007

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

Ouch... only the second is valid xml. Most tools require at least a well
formed document. You may try using BeautifulStoneSoup, included with
BeautifulSoup http://crummy.com/software/BeautifulSoup/

I found something that works, although I couldn't tell you why it
works.
retag = re.compile(r'<.+?>', re.DOTALL)
tags = retag.findall(retag)
Why does that work?

That means: "look for a less-than sign (<), followed by the shortest
sequence of (?) one or more (+) arbitrary characters (.), followed by a
greater-than sign (>)"

If you never get nested tags, and never have a ">" inside an attribute,
that expression *might* work. But please try BeautifulStoneSoup, it uses a
lot of heuristics trying to guess the right structure. Doesn't work
always, but given your input, there isn't much one can do...

Sean DiZazzo · Dec 14, 2007

En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <[email protected]>
escribió:

Ouch... only the second is valid xml. Most tools require at least a well
formed document. You may try using BeautifulStoneSoup, included with
BeautifulSouphttp://crummy.com/software/BeautifulSoup/

That means: "look for a less-than sign (<), followed by the shortest
sequence of (?) one or more (+) arbitrary characters (.), followed by a
greater-than sign (>)"

If you never get nested tags, and never have a ">" inside an attribute,
that expression *might* work. But please try BeautifulStoneSoup, it uses a
lot of heuristics trying to guess the right structure. Doesn't work
always, but given your input, there isn't much one can do...

Thanks! I'll take a look at BeautifulStoneSoup today and see what I
get.

~Sean

Help with regex	4	Nov 26, 2009
<Need Help>How to get the count of elements referencing another element in XSLT?	0	Sep 12, 2008
Stumped - Need XSLT Help	4	Sep 12, 2006
xml modifications	3	Sep 29, 2007
Natural Language Processing with Python .dispersion_plot returns nothing	4	Jun 17, 2013
replace a string delimited by 2 other string, regexp problem	3	Oct 2, 2006
Help with code	0	Jun 12, 2022
Question about attribute inheritence in XML Schemas using <xsd:extension>	2	Dec 16, 2003

RegExp Help

Sean DiZazzo

Sean DiZazzo

Marc 'BlackJack' Rintsch

Sean DiZazzo

alimuddin

alimuddin

alimuddin

Gabriel Genellina

Sean DiZazzo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads