lisp is winner in DOM parsing contest! 8-]

A

Alex Mizrahi

Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

then i asdf-installed s-xml package and tried it with it. it ate only 25
megs for lxml representation. i think interning element names helped a lot..
it was CLISP that has unicode inside, so i think it could be even less
without unicode..

then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks like
interning helps a lot 8-]

then i tried Perl XML::DOM.. it was better than python - about 180megs, but
it was slowest.. at least it consumed mem slower than python 8-]

and java.. with default parser it took 45mbs.. maybe it interned strings,
but there was overhead from classes - storing trees is definitely what's
lisp optimized for 8-]

so lisp is winner.. but it has not standard way (even no non-standard but
simple) way to write binary IEEE floating point representation, so common
lisp suck and i will use c++ for my task.. 8-]]]

With best regards, Alex 'killer_storm' Mizrahi.
 
H

Hans Nowak

Alex said:
i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

Have you tried ElementTree?

http://effbot.org/zone/element-index.htm

HTH,
 
A

Alex Mizrahi

(message (Hello 'Hans)
(you :wrote :eek:n '(Sun, 11 Jul 2004 21:32:11 -0400))
(

HN> Have you tried ElementTree?

no..
just tried it - it eats 133 megs and parses for quite a long time, however
it works..
i'll consider using it because processing xml in c++ appears to be pain in
ass..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
 
P

Peter Hansen

Alex said:
i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.

Often, problems with performance come down the using the
wrong algorithm, or using the wrong architecture for the
problem at hand.

Are you absolutely certain that using a full in-memory DOM
representation is the best for your problem? It seems
very unlikely to me that it really is...

For example, there are approaches which can read in the
document incrementally (and I'm not just talking SAX here),
rather than read the whole thing at once.

I found your analysis fairly simplistic, on the whole...

-Peter
 
P

Paul Rubin

Peter Hansen said:
For example, there are approaches which can read in the
document incrementally (and I'm not just talking SAX here),
rather than read the whole thing at once.

Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.
 
J

John Lenton

Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

in my experience xml.dom.minidom is a hog; there are several other DOM
and tree-building (i.e., DOMish) parsers out there, most of them
better from a performance point of view; my own personal favourite is
libxml2, but google for the issue and you'll even come across people
who have compared the different things.
 
T

Tim Bradshaw

Paul Rubin said:
Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.

I dunno if this has been done recently, but this is the sort of thing
that people used to do for very large SGML documents. I forget the
details, but I remember things that were some hundreds of MB (parsed
or unparsed I'm not sure) which would be written out in some parsed
form into large files, which could then be manipulated as if the whole
object was there. Of course no one would care about a few hundred MB
of memory now, but they did then (this was 91-92 I think).

I had a theory of doing this all lazily, so you wouldn't have to do
the (slow) parsing step up-front but would just lie and say `OK, I
parsed it', then actually doing the work only on demand.

--tim
 
A

Alex Mizrahi

(message (Hello 'Peter)
(you :wrote :eek:n '(Sun, 11 Jul 2004 22:15:50 -0400))
(

PH> Often, problems with performance come down the using the wrong
PH> algorithm, or using the wrong architecture for the problem at hand.

i see nothing wrong in loading 3 mb data into RAM. however, implementation
details made it 100 times larger and it was the problem..

PH> Are you absolutely certain that using a full in-memory DOM
PH> representation is the best for your problem? It seems very unlikely
PH> to me that it really is...

format i'm dealing with is quite chaotic and i'm going to work with it
interactively - track down myself where data i need lie and see how can i
extract data..
it's only a small part of task and it's needed only temporarily, so i don't
need best thing possible - i need something that just works..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
 
A

Alex Mizrahi

(message (Hello 'Paul)
(you :wrote :eek:n '(11 Jul 2004 20:13:42 -0700))
(

PR> Rather than either reading incrementally or else slurping in the
PR> entire document in many-noded glory, I wonder if anyone's
PR> implemented a parser that scans over the XML doc and makes a compact
PR> sequential representation of the tree structure, and then provides
PR> access methods that let you traverse the tree as if it were a real
PR> DOM, by fetching the appropriate strings from the (probably mmap'ed)
PR> disk file as you walk around in the tree.

that would be nice.. i remember i've did something like this for one binary
chunky format - thingie avoided allocating new memory as long as possible..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
 
R

Richie Hindle

[Paul]
Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.

It's not exactly what you describe here, but xml.dom.pulldom is roughly
this. You access the higher-level nodes by SAX, thus not using massive
amounts of memory, but you can access the children of those higher-level
elements using DOM. The canonical example is processing a large number of
XML records - the XML document is arbitrarily large but the individual
records aren't. Pulldom passes each record to you SAX-style, and you use
DOM to process the record.

Uche Ogbuji has a short article on xml.dom.pulldom here:
http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html

Here's the example from that article. Line 16 is the key to it - that's the
point at which you switch from SAX to DOM:

1 #Get the first line in Act IV, scene II
2
3 from xml.dom import pulldom
4
5 hamlet_file = open("hamlet.xml")
6
7 events = pulldom.parse(hamlet_file)
8 act_counter = 0
9 for (event, node) in events:
10 if event == pulldom.START_ELEMENT:
11 if node.tagName == "ACT":
12 act_counter += 1
13 scene_counter = 1
14 if node.tagName == "SCENE":
15 if act_counter == 4 and scene_counter == 2:
16 events.expandNode(node)
17 #Traditional DOM processing starts here
18 #Get all descendant elements named "LINE"
19 line_nodes = node.getElementsByTagName("LINE")
20 #Print the text data of the text node
21 #of the first LINE element
22 print line_nodes[0].firstChild.data
23 scene_counter += 1
 
R

R. Mattes

Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has
about 200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were.. first i
tried Python.. there's special interest group that wants to "make Python
become the premier language for XML processing" so i thought there will
be no problems with this document.. i used xml.dom.minidom to parse it..
after it ate 400 meg of RAM i killed it - i don't want such processing..
i think this is because of that fat class impementation - possibly
Python had some significant overhead for each object instance, or
something like this..

First of all: which parser did you actually use? There are quite a number
of XML parsers for python. I personally use the libxml2 one and never had
memory proplems like you describe.
then i asdf-installed s-xml package and tried it with it. it ate only 25
megs for lxml representation. i think interning element names helped a
lot.. it was CLISP that has unicode inside, so i think it could be even
less without unicode..

Hmmm. Hmmm ... i guess you know that you compare apples with pears?
S-XML is a nice, small parser but nowhere near a standard conformant
XML parser. Have a closer look at the webpage: no handling of character
encoding, no CDATA, can't handle what the author calls "special tags"
(like processing instruction), no schema/DTD support, and, most
important, no namespace support!
then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks
like interning helps a lot 8-]

Interning is _much_ easier without namespaces.
then i tried Perl XML::DOM.. it was better than python - about 180megs,
but it was slowest.. at least it consumed mem slower than python 8-]

and java.. with default parser it took 45mbs.. maybe it interned
strings, but there was overhead from classes - storing trees is
definitely what's lisp optimized for 8-]

But you never got to a _full_ DOM with you lxml parsing. What you got was
a list-of-lists. There's no 'parent' implementation for your lxml
elements (which means that you might need to path the whole thing
arround all the time).

If you want a serious comparison you either need to compare s-xml with
similar "lightweight" parsers in Perl/Python/Ruby etc. or write your own
fully DOM compliant parser in LISP (or is there one allready? I'm still
looking for a good one).

Just my 0.02 $

Ralf Mattes

so lisp is winner.. but it has not standard way (even no non-standard
but simple) way to write binary IEEE floating point representation, so
common lisp suck and i will use c++ for my task.. 8-]]]

With best regards, Alex 'killer_storm' Mizrahi.
 
F

Frank Buss

Alex Mizrahi said:
so lisp is winner.. but it has not standard way (even no non-standard
but simple) way to write binary IEEE floating point representation, so
common lisp suck and i will use c++ for my task.. 8-]]]

This is a silly reason. You can integrate a C program for this, if you
like, but Common Lisp has the function INTEGER-DECODE-FLOAT, which you can
use for this, see for example this listing:

http://www.ai.mit.edu/people/cvince/date/src/code/java/serialization.lisp

But you need not to use this little complicated function, because CLISP has
the non-standard function EXT:WRITE-FLOAT, which "writes a floating-point
number in IEEE 754 binary representation":

http://clisp.cons.org/impnotes/stream-dict.html#bin-output

I assume other Lisp implementations have something similiar.
 
P

Peter Hansen

Alex said:
(message (Hello 'Peter)

PH> Often, problems with performance come down the using the wrong
PH> algorithm, or using the wrong architecture for the problem at hand.

i see nothing wrong in loading 3 mb data into RAM. however, implementation
details made it 100 times larger and it was the problem..

What is problematic about that for you?
PH> Are you absolutely certain that using a full in-memory DOM
PH> representation is the best for your problem? It seems very unlikely
PH> to me that it really is...

format i'm dealing with is quite chaotic and i'm going to work with it
interactively - track down myself where data i need lie and see how can i
extract data..

You didn't mention this before. If you're doing it interactively,
which I assume means with you actually typing lines of code that
will be executed in real-time, as you hit ENTER, then why the heck
are you concerned about the RAM footprint (i.e. there's nothing wrong
with loading 100MB of data into RAM for such a case either) or
even performance (since clearly you are going to be spending many
times more time working with the data than it takes to parse it,
even with some of the slower methods)?

Like I said, pick the right architecture for your problem domain...

-Peter
 
P

Peter Hansen

R. Mattes said:
But you never got to a _full_ DOM with you lxml parsing. What you got was
a list-of-lists. There's no 'parent' implementation for your lxml
elements (which means that you might need to path the whole thing
arround all the time).

If you want a serious comparison you either need to compare s-xml with
similar "lightweight" parsers in Perl/Python/Ruby etc.

If that's what he wants, it would be called PyRXP in the Python world,
a Python wrapper around the RXP library, available from
http://www.reportlab.com/ . Has a much smaller footprint than the
DOM representations, as you would expect, and lightning fast.

-Peter
 
A

Alex Mizrahi

(message (Hello 'Peter)
(you :wrote :eek:n '(Mon, 12 Jul 2004 09:17:58 -0400))
(

PH> What is problematic about that for you?

i have only 500 mb of RAM. i've killed parser when it ate 400 MBs.. i don't
know for sure how much it needs but it's too much anyway..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
 
C

Cameron Laird

Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.

While I don't yet follow all the places this thread has gone,
tDOM <URL: http://wiki.tcl.tk/tdom > is where I turn when I
want *fast* DOMish handling. Although its author favors Tcl,
there's no particular reason not to use it with Python.
 
C

Cameron Laird

.
.
.
so lisp is winner.. but it has not standard way (even no non-standard but
simple) way to write binary IEEE floating point representation, so common
lisp suck and i will use c++ for my task.. 8-]]]
.
.
.
I'm having trouble following this thread. In isolation, though,
this claim *can't* be true. Is there a particular Lisp implemen-
tation you have in mind?
 
P

Paul Rubin

[email protected] (Tim Bradshaw) said:
I had a theory of doing this all lazily, so you wouldn't have to do
the (slow) parsing step up-front but would just lie and say `OK, I
parsed it', then actually doing the work only on demand.

1) I think you do need to do the parsing up front, since you have to
provide basically random access to the contents of the tree.

2) The parsing needn't be slow. It should just make a very fast
linear scan over the document, emitting an in-memory record every time
it sees a tag that makes a new tree node. It would remember on a
stack where it is in the tree structure as it scans, and whenever it
found a tag that closes off a subtree (e.g. </table>), it could emit
pointers to where that subtree's parent and previously seen sibling
are, and update the parent to know where the last-seen child is.
Finally, it could make another pass over the in-memory structure to
vectorize the list of children at each node. This could be done with
even less memory by making two passes over the document (one to count
the children for each node and one to build the tree with a vector of
children at each node, allocated to the now-known size), at the
expense of some speed. The point is, not much processing needs to be
done during the scan. It should certainly be possible to parse a 3 MB
document in under a second (I assume we're talking about a C
extension). It just baffles me why the libraries that are out there
are so much slower.
 
J

John J. Lee

While I don't yet follow all the places this thread has gone,
tDOM <URL: http://wiki.tcl.tk/tdom > is where I turn when I
want *fast* DOMish handling. Although its author favors Tcl,
there's no particular reason not to use it with Python.

How would one use it with Python?


John
 
U

Uche Ogbuji

Alex Mizrahi said:
Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

minidom has about a 60X - 80X load factor on average (comparing XML
file size to memory working set). You're claiming you saw a 130X load
factor. That sounds odd Are there special characteristics of your
document you're not mentioning?

cDomlette, part of 4Suite, only has a 10X load factor, so I'd guess
your example would end up with a 30MB memory working set. cDomlette
does use string interning, as one example of optimization techniques.
4Suite also provides you XSLT, XPath, RELAX NG and some other
processing goodies.

See:

http://4suite.org/
http://www.xml.com/pub/a/2002/10/16/py-xml.html
http://uche.ogbuji.net/akara/nodes/2003-01-01/domlettes?xslt=/akara/akara.xslt

As you can see, cDomlette is as DOM-like as minidom, so very easy to
use (for the record neither is a compliant W3C DOM implementation).

Also, in my next XML.com article I cover a technique that uses SAX to
break large XML documents into series of small DOMs, one after the
other, so that the memory penalty is *very* low, depending on your
document structure. It works with any DOM implementation that meets
the Python DOM binding, including minidom and cDomlette.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top