ElementTree oddities

Brian Cole · Sep 15, 2008

I'm trying to extract the text from some xml. I figured this
convenient python two-liner would do it for me:

from xml.etree.ElementTree import *
from cStringIO import StringIO
root = parse(StringIO(xml)).getroot()
' '.join([n.text for n in root.getiterator() if n.text is not None])

Click to expand...

Click to expand...

However, it's missing some of the text. For example, the following
XML:
Returns me a empty string. Seems the "<sp />" tag is borking it.

Also, the for the following XML:
I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

Thanks,
Brian

Fredrik Lundh · Sep 15, 2008

Brian said:
However, it's missing some of the text. For example, the following
XML:
>

Returns me a empty string. Seems the "<sp />" tag is borking it.

Also, the for the following XML:

I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

you're missing the "tail" attribute, which contains text that follows
directly *after* the element's end tag. it's not exactly a one-liner,
but I usually use the one on this page:

http://effbot.org/zone/element-bits-and-pieces.htm#gettext

</F>

gomesjas · Sep 15, 2008

I'm not sure, but I think your document is not well formated...

Anyone as the name of the module you must think about XML, not as a
flat doc, but as a tree that's the only way I got to parse XML.

Brian Cole a écrit :

I'm trying to extract the text from some xml. I figured this
convenient python two-liner would do it for me:

from xml.etree.ElementTree import *
from cStringIO import StringIO
root = parse(StringIO(xml)).getroot()
' '.join([n.text for n in root.getiterator() if n.text is not None])

Click to expand...

Click to expand...

However, it's missing some of the text. For example, the following
XML:
Returns me a empty string. Seems the "<sp />" tag is borking it.

Also, the for the following XML:
I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

Thanks,
Brian

Mark Thomas · Sep 15, 2008

Fredrik is correct, the text attribute only contains text before a
child element; tail contains the rest. It is indeed rather odd. For
comparison, here's how you would do it in lxml (http://codespeak.net/
lxml/index.html), a library which supports XPath:

from lxml import etree
tree = etree.fromstring('<highlight><ref>Bar</ref>:</highlight>')
print ' '.join(tree.xpath('//text()'))

Stefan Behnel · Sep 16, 2008

Mark said:
here's how you would do it in lxml (http://codespeak.net/
lxml/index.html), a library which supports XPath:

from lxml import etree
tree = etree.fromstring('<highlight><ref>Bar</ref>:</highlight>')
print ' '.join(tree.xpath('//text()'))

If you want to use XPath, try this:

print tree.xpath('string()')

or if you want to use it in real code:

get_tree_text = etree.XPath('string()')
print get_tree_text(tree)

or just use

print etree.tostring(tree, method="text")

Stefan

ElementTree XML parsing problem	8	Apr 27, 2011
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
ElementTree Issue - Search and remove elements	2	Oct 17, 2012
elementtree and rounding questions	1	Jul 30, 2008
ElementTree: can't figure out a mismached-tag error	0	Jul 11, 2013
ElementTree XML Namspace	1	Nov 14, 2008
Iterparse and ElementTree confusion	4	Aug 17, 2005
Elementtree find problem	1	Dec 11, 2007

ElementTree oddities

Brian Cole

Fredrik Lundh

gomesjas

Mark Thomas

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads