ElementTree oddities

B

Brian Cole

I'm trying to extract the text from some xml. I figured this
convenient python two-liner would do it for me:
from xml.etree.ElementTree import *
from cStringIO import StringIO
root = parse(StringIO(xml)).getroot()
' '.join([n.text for n in root.getiterator() if n.text is not None])

However, it's missing some of the text. For example, the following
XML:
Returns me a empty string. Seems the "<sp />" tag is borking it.


Also, the for the following XML:
I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

Thanks,
Brian
 
F

Fredrik Lundh

Brian said:
However, it's missing some of the text. For example, the following
XML:
>

Returns me a empty string. Seems the "<sp />" tag is borking it.


Also, the for the following XML:

I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

you're missing the "tail" attribute, which contains text that follows
directly *after* the element's end tag. it's not exactly a one-liner,
but I usually use the one on this page:

http://effbot.org/zone/element-bits-and-pieces.htm#gettext

</F>
 
G

gomesjas

I'm not sure, but I think your document is not well formated...

Anyone as the name of the module you must think about XML, not as a
flat doc, but as a tree that's the only way I got to parse XML.

Brian Cole a écrit :
I'm trying to extract the text from some xml. I figured this
convenient python two-liner would do it for me:
from xml.etree.ElementTree import *
from cStringIO import StringIO
root = parse(StringIO(xml)).getroot()
' '.join([n.text for n in root.getiterator() if n.text is not None])

However, it's missing some of the text. For example, the following
XML:
Returns me a empty string. Seems the "<sp />" tag is borking it.


Also, the for the following XML:
I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

Thanks,
Brian
 
M

Mark Thomas

Fredrik is correct, the text attribute only contains text before a
child element; tail contains the rest. It is indeed rather odd. For
comparison, here's how you would do it in lxml (http://codespeak.net/
lxml/index.html), a library which supports XPath:

from lxml import etree
tree = etree.fromstring('<highlight><ref>Bar</ref>:</highlight>')
print ' '.join(tree.xpath('//text()'))
 
S

Stefan Behnel

Mark said:
here's how you would do it in lxml (http://codespeak.net/
lxml/index.html), a library which supports XPath:

from lxml import etree
tree = etree.fromstring('<highlight><ref>Bar</ref>:</highlight>')
print ' '.join(tree.xpath('//text()'))

If you want to use XPath, try this:

print tree.xpath('string()')

or if you want to use it in real code:

get_tree_text = etree.XPath('string()')
print get_tree_text(tree)

or just use

print etree.tostring(tree, method="text")

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,244
Members
46,839
Latest member
MartinaBur

Latest Threads

Top