A
Andrew Robinson
Good day ,
I've been exploring XML parsers in python; particularly:
xml.etree.cElementTree; and I'm trying to figure out how to do it
incrementally, for very large XML files -- although I don't think the
problems are restricted to incremental parsing.
First problem:
I've come across an issue where etree silently drops text without
telling me; and separate.
I am under the impression that XHTML is a subset of XML (eg:defined
tags), and that once an HTML file is converted to XHTML, the body of the
document can be handled entirely as XML.
If I convert a (partial/contrived) html file like:
<html>
<div>
<p> This is example <b>bold</b> text.
</div>
</html>
to XHTML, I might do --right or wrong-- (1):
<html>
<div>
<p /> This is example <b>bold</b> text.
</div>
</html>
or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
But, when I parse with etree, in example (1) both "This is an example"
and "text." are dropped;
The missing text is part of the start, or end event tags, in the
incrementally parsed method.
Likewise: In example (2), only "text" gets dropped.
So, etree is silently dropping all text following a close tag, but
before another open tag happens.
Q:
Isn't XML supposed to error out when invalid xml is parsed?
Is there a way in etree to recover/access the dropped text?
If not -- is the a python library issue, or the underlying expat.so,
etc. library.
Secondly;
I have an XML file which will grow larger than memory on a target
machine, so here's what I want to do:
Given a source XML file, and a destination file:
1) iteratively scan part of the source tree.
2) Optionally Modify some of scanned tree.
3) Write partial scan/tree out to the destination file.
4) Free memory of no-longer needed (partial) source XML.
5) continue scanning a new section of the source file... eg: goto step 1
until source file is exhausted.
But, I don't see a way to write portions of an XML tree, or iteratively
write a tree to disk.
How can this be done?
Thanks!
I've been exploring XML parsers in python; particularly:
xml.etree.cElementTree; and I'm trying to figure out how to do it
incrementally, for very large XML files -- although I don't think the
problems are restricted to incremental parsing.
First problem:
I've come across an issue where etree silently drops text without
telling me; and separate.
I am under the impression that XHTML is a subset of XML (eg:defined
tags), and that once an HTML file is converted to XHTML, the body of the
document can be handled entirely as XML.
If I convert a (partial/contrived) html file like:
<html>
<div>
<p> This is example <b>bold</b> text.
</div>
</html>
to XHTML, I might do --right or wrong-- (1):
<html>
<div>
<p /> This is example <b>bold</b> text.
</div>
</html>
or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
But, when I parse with etree, in example (1) both "This is an example"
and "text." are dropped;
The missing text is part of the start, or end event tags, in the
incrementally parsed method.
Likewise: In example (2), only "text" gets dropped.
So, etree is silently dropping all text following a close tag, but
before another open tag happens.
Q:
Isn't XML supposed to error out when invalid xml is parsed?
Is there a way in etree to recover/access the dropped text?
If not -- is the a python library issue, or the underlying expat.so,
etc. library.
Secondly;
I have an XML file which will grow larger than memory on a target
machine, so here's what I want to do:
Given a source XML file, and a destination file:
1) iteratively scan part of the source tree.
2) Optionally Modify some of scanned tree.
3) Write partial scan/tree out to the destination file.
4) Free memory of no-longer needed (partial) source XML.
5) continue scanning a new section of the source file... eg: goto step 1
until source file is exhausted.
But, I don't see a way to write portions of an XML tree, or iteratively
write a tree to disk.
How can this be done?
Thanks!