Writing big XML files where beginning depends on end.

M

Magnus Lycka

We're using DOM to create XML files that describes fairly
complex calculations. The XML is structured as a big tree,
where elements in the beginning have values that depend on
other values further down in the tree. Imagine something
like below, but much bigger and much more complex:

<node sum="15">
<node sum="10">
<leaf>7</leaf'>
<node sum="3">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="5">
<leaf>5</leaf>
</node>
</node>

We have to stick with this XML structure for now.

In some cases, building up a DOM tree in memory takes up
several GB of RAM, which is a real showstopper. The actual
file is maybe a magnitute smaller than the DOM tree. The
app is using libxml2. It's actually written in C++. Some
library that used much less memory overhead could be
sufficient.

We've thought of writing a file that looks like this...

<node sum="#1">
<node sum="#1.1">
<leaf>7</leaf'>
<node sum="#1.1.1">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="#1.2">
<leaf>5</leaf>
</node>
</node>

....and store {"#": "15", "#1.1", "10" ... } in a map
and then read in a piece at a time and performs some
simple change and replace to get the correct values in.
Then we need something that allows parts of the XML file
to be written to file and purged from RAM to avoid the
memory problem.

Suggestions for solutions are appreciated.
 
N

Neil Benn

Magnus said:
<snip>

In some cases, building up a DOM tree in memory takes up
several GB of RAM, which is a real showstopper. The actual
file is maybe a magnitute smaller than the DOM tree. The
app is using libxml2. It's actually written in C++. Some
library that used much less memory overhead could be
sufficient.

<snip>
Hello,

Regardless of the wisdom of having an XML file that big or
sttructred in that way (you stated that you are forced to use this), if
you need to run through such a large amount of data then use SAX rather
than DOM, this runs on a stream based implementation so it can
comfortbaly scael up to large amnounts of data such as yours. To keep
hold of the state between the start you'll have to do that manually
(storing only the state stuff that matters) - to be honest I'd take a
step back and look at using a different data representation.

Cheers,

Neil

--

Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 46
D-01307
Dresden
Germany

Tel : +49 (0)351 4173 154
e-mail : (e-mail address removed)
Cenix Website : http://www.cenix-bioscience.com
 
F

Fredrik Lundh

Magnus said:
We're using DOM to create XML files that describes fairly
complex calculations. The XML is structured as a big tree,
where elements in the beginning have values that depend on
other values further down in the tree. Imagine something
like below, but much bigger and much more complex:

<node sum="15">
<node sum="10">
<leaf>7</leaf'>
<node sum="3">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="5">
<leaf>5</leaf>
</node>
</node>

what's the typical overall structure for this tree ? is it short and wide, or tall and
narrow ?

or in other words, if you would extract all /node/node elements (that is, node
elements just under the document root) from a large file, would you typically get
a long list with small nodes, or a short list with a few large nodes ?

</F>
 
B

Ben Sizer

output = []

def write_tree(node):
output.append("</node>")
sum = 0
for child in reversed(node.children):
if child.type = leaf:
output.extend(["</leaf>", "%d" % node.value, "<leaf>"])
sum += node.value
else:
sum += write_tree(node)
output.append("<node sum=%d>" % sum)
return sum

write_tree(rootNode)
output.reverse()
print output

:)
(There is probably a cleaner way to do this than the quick hack above.)
 
M

Magnus Lycka

Fredrik said:
what's the typical overall structure for this tree ? is it short and wide, or tall and
narrow ?

That varies quite a bit. It's basically a call graph for a
functional language, where also aggregated results are stored
in the XML tree. In other words, the shape depends on the
program we're analyzing. With a loop in the top level, we
could get plenty of nodes in the root, but we need to handle
different cases. Typically, the language in question isn't
used to build very complex things, so this hasn't really
been an issue before...

I tried to look at a 100MB xml file in a browser, to see the
shape, but Mozilla died after growing to several GB in size...
 
M

Magnus Lycka

Ben said:
output = []

def write_tree(node):
output.append("</node>")
sum = 0
for child in reversed(node.children):
if child.type = leaf:
output.extend(["</leaf>", "%d" % node.value, "<leaf>"])
sum += node.value
else:
sum += write_tree(node)
output.append("<node sum=%d>" % sum)
return sum

write_tree(rootNode)
output.reverse()
print output

This won't help if we have problems keeping the whole
structure / call graph in memory at one time. Except
for the aggregated results that appear in the beginning
of each element, the xml structure basically reflects
what's happening over time, so it seems that it would
be easy to write the xml to file a chunk at a time if
we could just fill in those values in the already
written part of the file conveniently.
 
B

Ben Sizer

Magnus said:
This won't help if we have problems keeping the whole
structure / call graph in memory at one time.

Ah, I had assumed that the main problem was just that the resulting DOM
was too big. Still, I don't think there would be a problem if you just
constructed a very small tree and then wrote out XML based upon that,
instead of using an external library for this. For example, if your
only elements are node and leaf, you can store that distinction in 1
bit, as opposed to a library's naive implementation using the strings
"node" and "leaf", probably 40 bits in each case. Similarly you can
calculate the aggregate attributes on the fly when it comes to
outputting the tree instead of storing them.
 
G

Gerard Flanagan

Magnus said:
We're using DOM to create XML files that describes fairly
complex calculations. The XML is structured as a big tree,
where elements in the beginning have values that depend on
other values further down in the tree. Imagine something
like below, but much bigger and much more complex:

<node sum="15">
<node sum="10">
<leaf>7</leaf'>
<node sum="3">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="5">
<leaf>5</leaf>
</node>
</node>

We have to stick with this XML structure for now.
We've thought of writing a file that looks like this...

<node sum="#1">
<node sum="#1.1">
<leaf>7</leaf'>
<node sum="#1.1.1">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="#1.2">
<leaf>5</leaf>
</node>
</node>

...and store {"#": "15", "#1.1", "10" ... } in a map
and then read in a piece at a time and performs some
simple change and replace to get the correct values in.
Then we need something that allows parts of the XML file
to be written to file and purged from RAM to avoid the
memory problem.

Suggestions for solutions are appreciated.

Magnus

what about multiple xml files? Then any given node could have a value
"2.3.xml" which is loaded on demand.
Or, IANA Computer Scientist, but isn't every tree a list? Then maybe
flatten/unflatten nodes to a string representation

eg. { "#1" : 15, "#1.1" : "[ 10 [7] [3 [2][1] ] ]", "#1.2" : "[ [5] ]"
}

Hope this isn't too naive.

Gerard
 
M

Magnus Lycka

Gerard said:
what about multiple xml files?

We have deployed code that relies of the XML files
looking the way they do, and we'd prefer not to
change that.
 
L

Laurent Pointal

Magnus said:
We're using DOM to create XML files that describes fairly
complex calculations. The XML is structured as a big tree,
where elements in the beginning have values that depend on
other values further down in the tree. Imagine something
like below, but much bigger and much more complex:

<node sum="15">
<node sum="10">
<leaf>7</leaf'>
<node sum="3">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="5">
<leaf>5</leaf>
</node>
</node>

We have to stick with this XML structure for now.

In some cases, building up a DOM tree in memory takes up
several GB of RAM, which is a real showstopper. The actual
file is maybe a magnitute smaller than the DOM tree. The
app is using libxml2. It's actually written in C++. Some
library that used much less memory overhead could be
sufficient.

We've thought of writing a file that looks like this...

<node sum="#1">
<node sum="#1.1">
<leaf>7</leaf'>
<node sum="#1.1.1">
<leaf>2</leaf'>
<leaf>1</leaf>
</node>
</node>
<node sum="#1.2">
<leaf>5</leaf>
</node>
</node>

...and store {"#": "15", "#1.1", "10" ... } in a map
and then read in a piece at a time and performs some
simple change and replace to get the correct values in.
Then we need something that allows parts of the XML file
to be written to file and purged from RAM to avoid the
memory problem.

Suggestions for solutions are appreciated.

An idea.

Put spaces in your names <node sum="#1 ">
Store { "#1": <its offset in the result file> }.

When you have collected all your final data, go throught your stored #,
seek at their position in the file, and replace the data (writting same
amount of chars than reserved).

A kind of direct access to nodes in an XML document file.

A+

Laurent.
 
U

uche.ogbuji

"""
Then we need something that allows parts of the XML file
to be written to file and purged from RAM to avoid the
memory problem.

Suggestions for solutions are appreciated.
"""

Multiple XML files is not an option, but what about general entities or
XInclude? That way you don't need to change your parsing code.

Using 4Suite's MarkupWriter [1] you could write the outer shell and
inner subtrees to separate streams, only filling in values for the
outer stream when the inner stream is complete, and your computations
are ready. You can then use the writer.xmlFragment method to stitch
the inner subtrees to the outer shell. MarkupWriter operates in
streaming mode, so you would not be holding much XML in memory at all.

http://www.xml.com/pub/a/2005/04/20/py-xml.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top