XML parsing: SAX/expat & yield

K

kj

I want to write code that parses a file that is far bigger than
the amount of memory I can count on. Therefore, I want to stay as
far away as possible from anything that produces a memory-resident
DOM tree.

The top-level structure of this xml is very simple: it's just a
very long list of "records". All the complexity of the data is at
the level of the individual records, but these records are tiny in
size (relative to the size of the entire file).

So the ideal would be a "parser-iterator", which parses just enough
of the file to "yield" (in the generator sense) the next record,
thereby returning control to the caller; the caller can process
the record, delete it from memory, and return control to the
parser-iterator; once parser-iterator regains control, it repeats
this sequence starting where it left off.

The problem, as I see it, is that SAX-type parsers like expat want
to do everything with callbacks, which is not readily compatible
with the generator paradigm I just described.

Is there a way to get an xml.parsers.expat parser (or any other
SAX-type parser) to stop at a particular point to yield a value?

The only approach I can think of is to have the appropriate parser
callbacks throw an exception wherever a yield would have been.
The exception-handling code would have the actual yield statement,
followed by code that restarts the parser where it left off.
Additional logic would be necessary to implement the piecemeal
reading of the input file into memory.

But I'm not very conversant with SAX parsers, and even less with
generators, so all this may be unnecessary, or way off.

Any other tricks/suggestions for turning a SAX parsers into a
generator, please let me know.

~K
 
P

Peter Otten

kj said:
I want to write code that parses a file that is far bigger than
the amount of memory I can count on. Therefore, I want to stay as
far away as possible from anything that produces a memory-resident
DOM tree.

The top-level structure of this xml is very simple: it's just a
very long list of "records". All the complexity of the data is at
the level of the individual records, but these records are tiny in
size (relative to the size of the entire file).

So the ideal would be a "parser-iterator", which parses just enough
of the file to "yield" (in the generator sense) the next record,
thereby returning control to the caller; the caller can process
the record, delete it from memory, and return control to the
parser-iterator; once parser-iterator regains control, it repeats
this sequence starting where it left off.

How about

http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top