lxml: traverse xml tree and retrieve element based on an attribute

byron · May 21, 2009

I am using the lxml.etree library to validate an xml instance file
with a specified schema that contains the data types of each element.
This is some of the internals of a function that extracts the
elements:

schema_doc = etree.parse(schema_fn)
schema = etree.XMLSchema(schema_doc)

context = etree.iterparse(xml_fn, events=('start', 'end'),
schema=schema)

# get root
event, root = context.next()

for event, elem in context:
if event == 'end' and elem.tag == self.tag:
yield elem
root.clear()

I retrieve a list of elements from this... and do further processing
to represent them in different ways. I need to be able to capture the
data type from the schema definition for each field in the element.
i.e.

<xsd:element name="concept">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="foo"/>
<xsd:element name="concept_id" type="xsd:string"/>
<xsd:element name="line" type="xsd:integer"/>
<xsd:element name="concept_value" type="xsd:string"/>
<xsd:element ref="some_date"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

My thought is to recursively traverse through the schema definition
match the `name` attribute since they are unique to a `type` and
return that element. But I can't seem to make it quite work. All the
xml is valid, validation works, etc. This is what I have:

def find_node(tree, name):
for c in tree:
if c.attrib.get('name') == name:
return c
if len(c) > 0:
return find_node(c, name)
return 0

I may have been staring at this too long, but when something is
returned... it should be returned completely, no? This is what occurs
with `return find_node(c, name) if it returns 0. `return c` works
(used pdb to verify that), but the recursion continues and ends up
returning 0.

Thoughts and/or a different approach are welcome. Thanks

MRAB · May 21, 2009

byron said:
I am using the lxml.etree library to validate an xml instance file
with a specified schema that contains the data types of each element.
This is some of the internals of a function that extracts the
elements:

schema_doc = etree.parse(schema_fn)
schema = etree.XMLSchema(schema_doc)

context = etree.iterparse(xml_fn, events=('start', 'end'),
schema=schema)

# get root
event, root = context.next()

for event, elem in context:
if event == 'end' and elem.tag == self.tag:
yield elem
root.clear()

I retrieve a list of elements from this... and do further processing
to represent them in different ways. I need to be able to capture the
data type from the schema definition for each field in the element.
i.e.

<xsd:element name="concept">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="foo"/>
<xsd:element name="concept_id" type="xsd:string"/>
<xsd:element name="line" type="xsd:integer"/>
<xsd:element name="concept_value" type="xsd:string"/>
<xsd:element ref="some_date"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

My thought is to recursively traverse through the schema definition
match the `name` attribute since they are unique to a `type` and
return that element. But I can't seem to make it quite work. All the
xml is valid, validation works, etc. This is what I have:

def find_node(tree, name):
for c in tree:
if c.attrib.get('name') == name:
return c
if len(c) > 0:
return find_node(c, name)
return 0

You're searching the first child and then returning the result, but what
you're looking for might not be in the first child; if it's not then you
need to search the next child:

def find_node(tree, name):
for c in tree:
if c.attrib.get('name') == name:
return c
if len(c) > 0:
r = find_node(c, name)
if r:
return r
return None

byron · May 22, 2009

You're searching the first child and then returning the result, but what
you're looking for might not be in the first child; if it's not then you
need to search the next child:

def find_node(tree, name):
for c in tree:
if c.attrib.get('name') == name:
return c
if len(c) > 0:
r = find_node(c, name)
if r:
return r
return None

Thanks. Yes i tried something like this, but I think I overwrite `c`
when i wrote it, as in:

if len(c) > 0:
c = fin_node(c, name)
if c is not None:
return c

Thanks for you help.

MRAB · May 22, 2009

byron wrote:
[snip]

Thanks. Yes i tried something like this, but I think I overwrite `c`
when i wrote it, as in:

if len(c) > 0:
c = fin_node(c, name)
if c is not None:
return c

FYI, doing that won't actually matter in this case; 'c' will still be
bound to the next value on the next iteration of the loop because it's
just a reference to the iterator and 'assigning' won't affect the
iterator as in soem other languages.

byron · May 22, 2009

byron wrote:

[snip]

Thanks. Yes i tried something like this, but I think I overwrite `c`
when i wrote it, as in:

Click to expand...

if len(c) > 0:
c = fin_node(c, name)
if c is not None:
return c

Click to expand...

FYI, doing that won't actually matter in this case; 'c' will still be
bound to the next value on the next iteration of the loop because it's
just a reference to the iterator and 'assigning' won't affect the
iterator as in soem other languages.

Good to know. Thanks.

Stefan Behnel · May 30, 2009

byron said:
I am using the lxml.etree library to validate an xml instance file
with a specified schema that contains the data types of each element.
This is some of the internals of a function that extracts the
elements:

schema_doc = etree.parse(schema_fn)
schema = etree.XMLSchema(schema_doc)

context = etree.iterparse(xml_fn, events=('start', 'end'),
schema=schema)

# get root
event, root = context.next()

for event, elem in context:
if event == 'end' and elem.tag == self.tag:
yield elem
root.clear()

Note that you cannot modify the root element during iterparse() in
lxml.etree. It seems to work for you here, but it's not safe. Here's a
better way to do this.

http://www.ibm.com/developerworks/xml/library/x-hiperfparse/#N100FF

Stefan

How to convert Map to xml based on Schema.	10	Nov 17, 2012
Nokogiri::XML::Schema Cannot find the declaration of element	0	Dec 18, 2012
Problem inserting an element where I want it using lxml	2	Jan 5, 2011
[Newbie] XSD Validation for xml	1	Oct 9, 2009
"root element collision" while trying use commonality in two XML schemas	1	Aug 29, 2005
lxml validation and xpath id function	1	Jul 1, 2008
namespace	1	Mar 23, 2008
newbie problem with creating xsd	2	Mar 23, 2007

lxml: traverse xml tree and retrieve element based on an attribute

byron

MRAB

byron

MRAB

byron

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads