XML parsing: SAX/expat & yield

kj · Aug 4, 2010

I want to write code that parses a file that is far bigger than
the amount of memory I can count on. Therefore, I want to stay as
far away as possible from anything that produces a memory-resident
DOM tree.

The top-level structure of this xml is very simple: it's just a
very long list of "records". All the complexity of the data is at
the level of the individual records, but these records are tiny in
size (relative to the size of the entire file).

So the ideal would be a "parser-iterator", which parses just enough
of the file to "yield" (in the generator sense) the next record,
thereby returning control to the caller; the caller can process
the record, delete it from memory, and return control to the
parser-iterator; once parser-iterator regains control, it repeats
this sequence starting where it left off.

The problem, as I see it, is that SAX-type parsers like expat want
to do everything with callbacks, which is not readily compatible
with the generator paradigm I just described.

Is there a way to get an xml.parsers.expat parser (or any other
SAX-type parser) to stop at a particular point to yield a value?

The only approach I can think of is to have the appropriate parser
callbacks throw an exception wherever a yield would have been.
The exception-handling code would have the actual yield statement,
followed by code that restarts the parser where it left off.
Additional logic would be necessary to implement the piecemeal
reading of the input file into memory.

But I'm not very conversant with SAX parsers, and even less with
generators, so all this may be unnecessary, or way off.

Any other tricks/suggestions for turning a SAX parsers into a
generator, please let me know.

~K

Peter Otten · Aug 4, 2010

kj said:
I want to write code that parses a file that is far bigger than
the amount of memory I can count on. Therefore, I want to stay as
far away as possible from anything that produces a memory-resident
DOM tree.

The top-level structure of this xml is very simple: it's just a
very long list of "records". All the complexity of the data is at
the level of the individual records, but these records are tiny in
size (relative to the size of the entire file).

So the ideal would be a "parser-iterator", which parses just enough
of the file to "yield" (in the generator sense) the next record,
thereby returning control to the caller; the caller can process
the record, delete it from memory, and return control to the
parser-iterator; once parser-iterator regains control, it repeats
this sequence starting where it left off.

How about

http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Peter

kj · Aug 4, 2010

In said:
How about

http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Exactly!

Thanks!

~K

Parsing XML: SAX, DOM, Expat, or Something Else?	2	Jan 23, 2009
SAX unicode and ascii parsing problem	4	Nov 30, 2010
A question about yield	5	Nov 7, 2010
XML file parsing with SAX	3	Apr 23, 2005
sax EntityResolver problem (expat?)	1	Jun 10, 2004
ElementTree XML parsing problem	8	Apr 27, 2011
XML SAX parser bug?	4	Jan 19, 2006
XML: SAX and XInclude	2	Sep 10, 2007

XML parsing: SAX/expat & yield

kj

Peter Otten

kj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads