B
Ben Temperton
Hi there, I am parsing some huge xml files (1.8 Gb) that look like this:
<scan num='1'>
<peaks>some data</peaks>
<scan num='2'>
<peaks>some data</peaks>
</scan>
<scan num='3'>
<peaks>some data</peaks>
</scan>
</scan>
What I am trying to do is build up a dictionary of lists where the key is the parent scan num and the members of the list are the child scan nums.
I have created an iterator:
for event, elem in cElementTree.iterparse(filename):
if elem.tag == self.XML_SPACE + "scan":
parentId = int(elem.get('num'))
for child in elem.findall(self.XML_SPACE +'scan'):
try:
indexes = scans[parentId]
except KeyError:
indexes = []
scans[parentId] = indexes
childId = int(child.get('num'))
indexes.append(childId)
# choice 1 - child.clear()
#choice 2 - elem.clear()
#choice 3 - elem.clear()
If I don't use any of the clear functions, the method works fine, but is very slow (presumably because nothing is getting cleared from memory. But, if I implement any of the clear functions shown, then
childId = int(child.get('num'))
fails because child.get('num') returns a NoneType. If you dump the child element using cElementTree.dump(child), all of the attributes on the child scans are missing, even though the clear() calls are made after the assignment of the childId.
What I don't understand is why, given the calls are made after assignment, that the assignment then fails, but succeeds when clear() is not called.
When should I be calling clear() in this case to maximize speed?
Many thanks,
Ben
<scan num='1'>
<peaks>some data</peaks>
<scan num='2'>
<peaks>some data</peaks>
</scan>
<scan num='3'>
<peaks>some data</peaks>
</scan>
</scan>
What I am trying to do is build up a dictionary of lists where the key is the parent scan num and the members of the list are the child scan nums.
I have created an iterator:
for event, elem in cElementTree.iterparse(filename):
if elem.tag == self.XML_SPACE + "scan":
parentId = int(elem.get('num'))
for child in elem.findall(self.XML_SPACE +'scan'):
try:
indexes = scans[parentId]
except KeyError:
indexes = []
scans[parentId] = indexes
childId = int(child.get('num'))
indexes.append(childId)
# choice 1 - child.clear()
#choice 2 - elem.clear()
#choice 3 - elem.clear()
If I don't use any of the clear functions, the method works fine, but is very slow (presumably because nothing is getting cleared from memory. But, if I implement any of the clear functions shown, then
childId = int(child.get('num'))
fails because child.get('num') returns a NoneType. If you dump the child element using cElementTree.dump(child), all of the attributes on the child scans are missing, even though the clear() calls are made after the assignment of the childId.
What I don't understand is why, given the calls are made after assignment, that the assignment then fails, but succeeds when clear() is not called.
When should I be calling clear() in this case to maximize speed?
Many thanks,
Ben