Parsing XML streams

P

Peter Scott

I have a program that listens on an IRC channel and logs everything to
XML on standard output. The format of the XML is pretty
straightforward, looking like this:

<channel name='#sandbox'>
<message user='PeterScott'>Hello, my bot</message>
<message user='PeterScott'>This is a message</message>
<nickchange>
<oldnick>PeterScott</oldnick>
<newnick>PeterSc</newnick>
</nickchange>
</channel>

I'm writing another program that should parse that sort of XML on its
stdin, printing out a more user-friendly representation. For this, I
need to parse the XML as it comes in, not all at once.

I wrote a parser using xml.sax, and it works well---provided that I
read in the whole document. However, I want to be able to just read
the XML piece by piece, calling event handlers whenever something
happens and waiting for more to happen.

Is there some way to do this with the standard python xml parsers?
Will I need to use PyXML? Or what?

Thanks,
-Peter
 
J

Jeremy Bowers

Is there some way to do this with the standard python xml parsers?
Will I need to use PyXML? Or what?

xml.parsers.expat can parse things in pieces. It shouldn't be *too* much
work to convert over.
 
A

Alan Kennedy

Peter said:
I'm writing another program that should parse that sort of XML on its
stdin, printing out a more user-friendly representation. For this, I
need to parse the XML as it comes in, not all at once.

Peter,

Check out the IncrementalParser class in the library module

Lib/xml/sax/xmlreader.py

This extension of the standard XMLReader class acts just like a SAX
parser, in that it delivers SAX2 events to your ContentHandler as it
processes the tokens from the source XML document.

But rather than the parser itself controlling when and how it gets its
input, you control that through the use of the .feed() method. So you
can "drip feed" the parser with input if you wish.

Not all XML parsers support an IncrementalParser interface. In order
for an XML parser to support incremental parsing, it must have been
coded specifically to do so. Fortunately, the expat wrapper supplied
with the base distribution of python does support incremental parsing.

Which I think should solve your problem quite nicely. When you start
up your process for the first time, feed() the IncrementalParser a
document element (all XML document must have one and only one document
element). Then simply feed the output of your logging stream directly
to the IncrementalParser, as and when you receive it.

You should not have any problems with XML tokens being split over two
different .feed() calls either. For example, this should work just
fine

ip = IncrementalParser()
ip.feed('<docu')
ip.feed('ment')
ip.feed('/>')

When your logging stream is closing, simply feed a close tag for your
document element to your IncrementalParser, and everything will clean
up nicely.

Here is some sample code:

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.handler import ContentHandler

logentry = """
<channel name='#sandbox'>
<message user='PeterScott'>Hello, my bot</message>
<message user='PeterScott'>This is a message</message>
<nickchange>
<oldnick>PeterScott</oldnick>
<newnick>PeterSc</newnick>
</nickchange>
</channel>
"""

incr_parser = xml.sax.make_parser('xml.sax.expatreader')
incr_parser.setContentHandler(ContentHandler())
incr_parser.feed('<mylogstream>')
incr_parser.feed(logentry)
incr_parser.feed('</mylogstream>')
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,163
Messages
2,570,898
Members
47,436
Latest member
MaxD

Latest Threads

Top