python xml DOM? pulldom? SAX?

J

jog

Hi,
I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
<parent>
<page>
<title>bla</title>
<id></id>
<revision>
<id></id>
<text>blablabla</text>
<revision>
</page>
<page>
</page>
....
</parent>
I want to combine the text out of page:title and page:revision:text for
every single page element. One by one I want to index these combined
texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
or pulldom?
Or should I just use Xpath somehow.
I don`t want to do anything else with his xml file afterwards.
I hope someone will understand me.....
Thank you very much
Jog
 
F

Fredrik Lundh

jog said:
I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
I want to combine the text out of page:title and page:revision:text for
every single page element. One by one I want to index these combined
texts (so for each page one index)

here's one way to do it:

try:
import cElementTree as ET
except ImportError:
from elementtree import ElementTree as ET

for event, elem in ET.iterparse(file):
if elem.tag == "page":
title = elem.findtext("title")
revision = elem.findtext("revision/text")
print title, revision
elem.clear() # won't need this any more

references:

http://effbot.org/zone/element-index.htm
http://effbot.org/zone/celementtree.htm (for best performance)
http://effbot.org/zone/element-iterparse.htm

</F>
 
T

tooper

Hi,

I'd advocate for using SAX, as DOM related methods implies loading the
complete XML content in memory whereas SAX grab things on the fly.
SAX method should therefore be faster and less memory consuming...

By the way, if your goal is to just "combine the text out of page:title
and page:revision:text for every single page element", maybe you should
also consider an XSLT filter.

Regards,
Thierry
 
M

Michael Ekstrand

I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
[structure snipped]
I want to combine the text out of page:title and page:revision:text
for every single page element. One by one I want to index these
combined texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
or pulldom?

Definitely SAX IMHO, or xml.parsers.expat. For what you're doing, an
event-driven interface is ideal. DOM parses the *entire* XML tree into
memory at once, before you can do anything - highly inefficient for a
large data set like this. I've never used pulldom, it might have
potential, but from my (limited and flawed) understanding of it, I
think it may also wind up loading most of the file into memory by the
time you're done.

SAX will not build any memory structures other than the ones you
explicitly create (SAX is commonly used to build DOM trees). With SAX,
you can just watch for any tags of interest (and perhaps some
surrounding tags to provide context), extract the desired data, and all
that very efficiently.

It took me a bit to get the hang of SAX, but once I did, I haven't
looked back. Event-driven parsing is a brilliant solution to this
problem domain.
Or should I just use Xpath somehow.

XPath usually requires a DOM tree on which it can operate. The Python
XPath implementation (in PyXML) requires DOM objects. I see this as
being a highly inefficient solution.

Another potential solution, if the data file has extraneous
information: run the source file through an XSLT transform that strips
it down to only the data you need, and then apply SAX to parse it.

- Michael
 
A

Alan Kennedy

[jog]
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
[snip]

> I want to combine the text out of page:title and page:revision:text
> for every single page element. One by one I want to index these
> combined texts (so for each page one index)
> What is the most efficient API for that?:
> SAX ( I don´t thonk so)

SAX is perfect for the job. See code below.

If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory
to load it into a DOM.
> or pulldom?

Not sure how pulldom does it's pull "optimizations", but I think it
still builds an in-memory object structure for your document, which will
still take buckets of memory for such a big document. I could be wrong
though.
> Or should I just use Xpath somehow.

Using xpath normally requires building a (D)OM, which will consume
*lots* of memory for your document, regardless of how efficient the OM is.

Best to use SAX and XPATH-style expressions.

You can get a limited subset of xpath using a SAX handler and a stack.
Your problem is particularly well suited to that kind of solution. Code
that does a basic job of this for your specific problem is given below.

Note that there are a number of caveats with this code

1. characterdata handlers may get called multiple times for a single xml
text() node. This is permitted in the SAX spec, and is basically a
consequence of using buffered IO to read the contents of the xml file,
e.g. the start of a text node is at the end of the last buffer read, and
the rest of the text node is at the beginning of the next buffer.

2. This code assumes that your "revision/text" nodes do not contain
mixed content, i.e. a mixture of elements and text, e.g.
"<revision><text>This is a piece of <b>revision</b>
text</text></revision>. The below code will fail to extract all
character data in that case.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax

class Page:

def append(self, field_name, new_value):
old_value = ""
if hasattr(self, field_name):
old_value = getattr(self, field_name)
setattr(self, field_name, "%s%s" % (old_value, new_value))

class page_matcher(xml.sax.handler.ContentHandler):

def __init__(self, page_handler=None):
xml.sax.handler.ContentHandler.__init__(self)
self.page_handler = page_handler
self.stack = []

def check_stack(self):
stack_expr = "/" + "/".join(self.stack)
if '/parent/page' == stack_expr:
self.page = Page()
elif '/parent/page/title/text()' == stack_expr:
self.page.append('title', self.chardata)
elif '/parent/page/revision/id/text()' == stack_expr:
self.page.append('revision_id', self.chardata)
elif '/parent/page/revision/text/text()' == stack_expr:
self.page.append('revision_text', self.chardata)
else:
pass

def startElement(self, elemname, attrs):
self.stack.append(elemname)
self.check_stack()

def endElement(self, elemname):
if elemname == 'page' and self.page_handler:
self.page_handler(self.page)
self.page = None
self.stack.pop()

def characters(self, data):
self.chardata = data
self.stack.append('text()')
self.check_stack()
self.stack.pop()

testdoc = """
<parent>
<page>
<title>Page number 1</title>
<id>p1</id>
<revision>
<id>r1</id>
<text>revision one</text>
</revision>
</page>
<page>
<title>Page number 2</title>
<id>p2</id>
<revision>
<id>r2</id>
<text>revision two</text>
</revision>
</page>
</parent>
"""

def page_handler(new_page):
print "New page"
print "title\t\t%s" % new_page.title
print "revision_id\t%s" % new_page.revision_id
print "revision_text\t%s" % new_page.revision_text
print

if __name__ == "__main__":
parser = xml.sax.make_parser()
parser.setContentHandler(page_matcher(page_handler))
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HTH,
 
F

Fredrik Lundh

Alan said:
SAX is perfect for the job. See code below.

depends on your definition of perfect...

using a 20 MB version of jog's sample, and having replaced
the print statements with local variable assignments, I get the
following timings:

5 lines of cElementTree code: 7.2 seconds
60+ lines of xml.sax code: 63 seconds

(Python 2.4.1, Windows XP, Pentium 3 GHz)

</F>
 
A

Alan Kennedy

[Alan Kennedy]
[Fredrik Lundh]
> depends on your definition of perfect...

Obviously, perfect is the eye of the beholder ;-)

[Fredrik Lundh]
using a 20 MB version of jog's sample, and having replaced
the print statements with local variable assignments, I get the
following timings:

5 lines of cElementTree code: 7.2 seconds
60+ lines of xml.sax code: 63 seconds

(Python 2.4.1, Windows XP, Pentium 3 GHz)

Impressive!

At first, I thought your code sample was building a tree for the entire
document, so I checked the API docs. It appeared to me that an event
processing model *couldn't* obtain the text for the node when notified
of the node: the text node is still in the future.

That's when I understood the nature of iterparse, which must generate an
event *after* the node is complete, and it's subdocument reified. That's
also when I understood the meaning of the "elem.clear()" call at the
end. Only the required section of the tree is modelled in memory at any
given time. Nice.

There are some minor inefficiencies in my pure python sax code, e.g.
building the stack expression for every evaluation, but I left them in
for didactic reasons. But even if every possible speed optimisation was
added to my python code, I doubt it would be able to match your code.

I'm guessing that a lot of the reason why cElementTree performs best is
because the model-building is primarily implemented in C: Both of our
solutions run python code for every node in the tree, i.e. are O(N). But
yours also avoids the overhead of having function-calls/stack-frames for
every single node event, by processing all events inside a single function.

If the SAX algorithm were implemented in C (or Java) for that matter, I
wonder if it might give comparable performance to the cElementTree code,
primarily because the data structures it is building are simpler,
compared to the tree-subsections being reified and discarded by
cElementTree. But that's not of relevance, because we're looking for
python solutions. (Aside: I can't wait to run my solution on a
fully-optimising PyPy :)

That's another nice thing I didn't know (c)ElementTree could do.

enlightened-ly'yrs,
 
W

William Park

jog said:
Hi,
I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
<parent>
<page>
<title>bla</title>
<id></id>
<revision>
<id></id>
<text>blablabla</text>
<revision>
</page>
<page>
</page>
....
</parent>
I want to combine the text out of page:title and page:revision:text for
every single page element. One by one I want to index these combined
texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
or pulldom?
Or should I just use Xpath somehow.
I don`t want to do anything else with his xml file afterwards.
I hope someone will understand me.....
Thank you very much
Jog

I would use Expat interface from Python, Awk, or even Bash shell. I'm
most familiar with shell interface to Expat, which would go something
like

start() # Usage: start tag att=value ...
{
case $1 in
page) unset title text ;;
esac
}
data() # Usage: data text
{
case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
title.page.*) title=$1 ;;
text.revision.page) text=$1 ;;
esac
}
end() # Usage: end tag
{
case $1 in
page) echo "title=$title text=$text" ;;
esac
}
expat -s start -d data -e end < file.xml

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
J

jog

Thanks a lot for all your replies that was really great and helpfull.
Now I have some problems with the indexing, it takes to much memory and
akes to long. I have to look into it.....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top