[jog]
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
[snip]
> I want to combine the text out of page:title and page:revision:text
> for every single page element. One by one I want to index these
> combined texts (so for each page one index)
> What is the most efficient API for that?:
> SAX ( I don´t thonk so)
SAX is perfect for the job. See code below.
If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory
to load it into a DOM.
Not sure how pulldom does it's pull "optimizations", but I think it
still builds an in-memory object structure for your document, which will
still take buckets of memory for such a big document. I could be wrong
though.
> Or should I just use Xpath somehow.
Using xpath normally requires building a (D)OM, which will consume
*lots* of memory for your document, regardless of how efficient the OM is.
Best to use SAX and XPATH-style expressions.
You can get a limited subset of xpath using a SAX handler and a stack.
Your problem is particularly well suited to that kind of solution. Code
that does a basic job of this for your specific problem is given below.
Note that there are a number of caveats with this code
1. characterdata handlers may get called multiple times for a single xml
text() node. This is permitted in the SAX spec, and is basically a
consequence of using buffered IO to read the contents of the xml file,
e.g. the start of a text node is at the end of the last buffer read, and
the rest of the text node is at the beginning of the next buffer.
2. This code assumes that your "revision/text" nodes do not contain
mixed content, i.e. a mixture of elements and text, e.g.
"<revision><text>This is a piece of <b>revision</b>
text</text></revision>. The below code will fail to extract all
character data in that case.
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
class Page:
def append(self, field_name, new_value):
old_value = ""
if hasattr(self, field_name):
old_value = getattr(self, field_name)
setattr(self, field_name, "%s%s" % (old_value, new_value))
class page_matcher(xml.sax.handler.ContentHandler):
def __init__(self, page_handler=None):
xml.sax.handler.ContentHandler.__init__(self)
self.page_handler = page_handler
self.stack = []
def check_stack(self):
stack_expr = "/" + "/".join(self.stack)
if '/parent/page' == stack_expr:
self.page = Page()
elif '/parent/page/title/text()' == stack_expr:
self.page.append('title', self.chardata)
elif '/parent/page/revision/id/text()' == stack_expr:
self.page.append('revision_id', self.chardata)
elif '/parent/page/revision/text/text()' == stack_expr:
self.page.append('revision_text', self.chardata)
else:
pass
def startElement(self, elemname, attrs):
self.stack.append(elemname)
self.check_stack()
def endElement(self, elemname):
if elemname == 'page' and self.page_handler:
self.page_handler(self.page)
self.page = None
self.stack.pop()
def characters(self, data):
self.chardata = data
self.stack.append('text()')
self.check_stack()
self.stack.pop()
testdoc = """
<parent>
<page>
<title>Page number 1</title>
<id>p1</id>
<revision>
<id>r1</id>
<text>revision one</text>
</revision>
</page>
<page>
<title>Page number 2</title>
<id>p2</id>
<revision>
<id>r2</id>
<text>revision two</text>
</revision>
</page>
</parent>
"""
def page_handler(new_page):
print "New page"
print "title\t\t%s" % new_page.title
print "revision_id\t%s" % new_page.revision_id
print "revision_text\t%s" % new_page.revision_text
print
if __name__ == "__main__":
parser = xml.sax.make_parser()
parser.setContentHandler(page_matcher(page_handler))
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
HTH,