RegExp Help

S

Sean DiZazzo

Hi group,

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

I have the following example string, and am simply trying to split it
into two xml tags...

simplified = """2007-12-13 <tag1 attr1="text1" attr2="text2" /tag1>
\n2007-12-13 <tag2 attr1="text1" attr2="text2" attr3="text3\n" /tag2>
\n"""

Basically I want the two tags, and to discard anything in between
using a reg exp. Like this:

tags = ["<tag1 attr1="text1" attr2="text2" /tag1>", "<tag2
attr1="text1" attr2="text2" attr3="text3\n" /tag2>"]

I've tried several approaches, some of which got close, but the
newline in the middle of one of the tags screwed it up. The closest
I've been is something like this:

retag = re.compile(r'<.+>*') # tried here with re.DOTALL as well
tags = re.findall(retag)

Can anyone help me?

~Sean
 
S

Sean DiZazzo

Hi group,

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

I have the following example string, and am simply trying to split it
into two xml tags...

simplified = """2007-12-13 <tag1 attr1="text1" attr2="text2" /tag1>
\n2007-12-13 <tag2 attr1="text1" attr2="text2" attr3="text3\n" /tag2>
\n"""

Basically I want the two tags, and to discard anything in between
using a reg exp. Like this:

tags = ["<tag1 attr1="text1" attr2="text2" /tag1>", "<tag2
attr1="text1" attr2="text2" attr3="text3\n" /tag2>"]

I've tried several approaches, some of which got close, but the
newline in the middle of one of the tags screwed it up. The closest
I've been is something like this:

retag = re.compile(r'<.+>*') # tried here with re.DOTALL as well
tags = re.findall(retag)

Can anyone help me?

~Sean

I found something that works, although I couldn't tell you why it
works. :)

retag = re.compile(r'<.+?>', re.DOTALL)
tags = retag.findall(retag)

Why does that work?

~Sean
 
M

Marc 'BlackJack' Rintsch

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

[…]

Can anyone help me?

Flaky XML is often produced by programs that treat XML as ordinary text
files. If you are starting to parse XML with regular expressions you are
making the very same mistake. XML may look somewhat simple but
producing correct XML and parsing it isn't. Sooner or later you stumble
across something that breaks producing or parsing the "naive" way.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Sean DiZazzo

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

Can anyone help me?

Flaky XML is often produced by programs that treat XML as ordinary text
files. If you are starting to parse XML with regular expressions you are
making the very same mistake. XML may look somewhat simple but
producing correct XML and parsing it isn't. Sooner or later you stumble
across something that breaks producing or parsing the "naive" way.

Ciao,
Marc 'BlackJack' Rintsch

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean
 
Joined
Dec 14, 2007
Messages
3
Reaction score
0
It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean
 
Joined
Dec 14, 2007
Messages
3
Reaction score
0
test

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean
 
Joined
Dec 14, 2007
Messages
3
Reaction score
0
testtest

alimuddin said:
It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean
There is no api, so this is what I have to work with. Is there a
better solution?
testa
 
G

Gabriel Genellina

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

Ouch... only the second is valid xml. Most tools require at least a well
formed document. You may try using BeautifulStoneSoup, included with
BeautifulSoup http://crummy.com/software/BeautifulSoup/
I found something that works, although I couldn't tell you why it
works. :)
retag = re.compile(r'<.+?>', re.DOTALL)
tags = retag.findall(retag)
Why does that work?

That means: "look for a less-than sign (<), followed by the shortest
sequence of (?) one or more (+) arbitrary characters (.), followed by a
greater-than sign (>)"

If you never get nested tags, and never have a ">" inside an attribute,
that expression *might* work. But please try BeautifulStoneSoup, it uses a
lot of heuristics trying to guess the right structure. Doesn't work
always, but given your input, there isn't much one can do...
 
S

Sean DiZazzo

En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <[email protected]>
escribió:






Ouch... only the second is valid xml. Most tools require at least a well
formed document. You may try using BeautifulStoneSoup, included with
BeautifulSouphttp://crummy.com/software/BeautifulSoup/


That means: "look for a less-than sign (<), followed by the shortest
sequence of (?) one or more (+) arbitrary characters (.), followed by a
greater-than sign (>)"

If you never get nested tags, and never have a ">" inside an attribute,
that expression *might* work. But please try BeautifulStoneSoup, it uses a
lot of heuristics trying to guess the right structure. Doesn't work
always, but given your input, there isn't much one can do...

Thanks! I'll take a look at BeautifulStoneSoup today and see what I
get.

~Sean
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,856
Latest member
MyronKatz6

Latest Threads

Top