Highly optimized business-rule validation of very large XML documents possible?

M

Mike

Related to another topic I just posted, I wanted to discuss ways to optimize
the validation of very large (>100MB) XML documents.

First, I have no idea if something like this already exists; it may even be
the typical implementation for all I know.

At any rate, it occurs to me that the set of business rules that need to be
validated against an XML document represent a limited set of nodes at any
given time (while parsing through the document). For example, if there is a
parent<->child node dependency, then only the pertinent information related
to those nodes needs to be kept in memory. Once the dependency has been
resolved (by validating the rule), the memory associated with those nodes
could then be freed. In this way, large documents could be validated
efficiently, by only storing information related to dependencies, and
immediately freeing memory once the dependency is resolved.

I don't have a lot of practical XML experience. But I've read, for example,
that using a SAX parser can be difficult in cases where you need to maintain
a lot of "state" information. So, what I'm asking is whether there is a
general solution to this problem, rather than having application-specific
code to handle the "state" of dependencies?

It seems to me that rule dependencies could be represented by a "graph",
similar in some ways to the Java garbage collector. And, like the garbage
collector, memory could be freed once there are no more "references" to a
particular dependency. The dependencies themselves would be something like
"threads" that connect nodes. Larger threads would require more memory.
Further optimization might be achieved by determining if the dependency
threads are better suited for a depth-first or breadth-first traversal, or
some combination.

In my other post, I ask about whether XML Schema can be used for validation
of rules like this, or if there are other solutions. In the context of this
post, does XML Schema or any other method support any of the concepts I talk
about above?

If XML Schema is unable to handle rules like this, and there is no other
available solution, does it make sense that something based on XPath might
work? I'm wondering if the XPath expressions could be used to represent the
dependencies (as in what to keep in memory), and then something else would
actually evaluate the dependency.

Thanks for any help/suggestions/comments,

Mike
 
P

Patrick TJ McPhee

% I don't have a lot of practical XML experience. But I've read, for example,
% that using a SAX parser can be difficult in cases where you need to maintain
% a lot of "state" information. So, what I'm asking is whether there is a

It's not difficult, it's just that you have to maintain the state information.
The parser won't do it for you.

% general solution to this problem, rather than having application-specific
% code to handle the "state" of dependencies?

There are SAX parsers which validate against DTDs. They presumably have
general ways of doing this without saving unneeded nodes. If you find
a SAX parser which validates against some other schema mechanism, then
they probably do it reasonably memory-efficiently. I don't think there's
anything in any of the common schema languages which precludes this.

% If XML Schema is unable to handle rules like this, and there is no other
% available solution, does it make sense that something based on XPath might
% work?

XPath more-or-less implies that you build a tree of some sort in memory,
so it works against what you want.
 
A

Alain Frisch

Patrick TJ McPhee, dans le message (comp.text.xml:58709), a écrit :
There are SAX parsers which validate against DTDs. They presumably have
general ways of doing this without saving unneeded nodes. If you find
a SAX parser which validates against some other schema mechanism, then
they probably do it reasonably memory-efficiently. I don't think there's
anything in any of the common schema languages which precludes this.

DTD is particularly easy to validate on a SAX stream, because of the
determinism condition of regular expressions (and the ID/IDREF constraints
are completely orthogonal, and also easy to check). Concerning XML Schema,
it is slightly more complex, but still, AFAICT, you can implement
validation without using complex tree automata (because regular expression
are actually regular expressions on tag names, not types). I guess the
memory complexity is linear in the depth of the document.
XPath more-or-less implies that you build a tree of some sort in memory,
so it works against what you want.

What are your arguments? Is this only an intuition? A large subset of
XPath can be rewritten to avoid backward axis, and can be evaluated with a
top-down left-to-right strategy, compatible with stream evaluation.
 
P

Patrick TJ McPhee

% > XPath more-or-less implies that you build a tree of some sort in memory,
% > so it works against what you want.
%
% What are your arguments? Is this only an intuition?

Granted `more-or-less implies' sounds a bit like I'm stating a thesis, but
I'm not categorically saying that it has to be that way. I would not be
at all surprised to find that one could write an XPath implementation which
works on a stream. Presumably, you'd want to get all the expressions to
evaluate up front, so that you don't have to run over the stream once
for each query, but that's just housekeeping.

My statement is based a little on XPath's theoretical use of a tree as
its document representation, and more significantly on my ignorance of
any XPath implementation which works on anything but a DOM tree. Some
of the databases put the tree (of some sort) on disk memory rather than
putting it on the bus, but I don't know of an implementation that works
against a stream. The OP did not sound like someone who was interested
in writing an XPath implementation to solve his or her problems, so the
available implementations are an issue.
 
B

Bob Foster

Nobody could answer your question without knowing what kind of "business
rules" you want to check.

But if the answer is you want to be able to check an arbitrary set of
XPath-based assertions, then there is only one well-known schema language
that can handle it - Schematron - and AFAIK all Schematron implementations
keep the entire document tree in memory (for starters). This doesn't
necessarily mean you couldn't handle a 100MB document, but that you should
expect to use a lot of memory to do it.

If you determine Schematron would meet your needs, I'd recommend trying some
test cases with, at least, an XSL-based implementation using Saxon as the
XSL processor. Saxon works very hard (in its tiny tree mode) to keep a
compact representation of the document. The other consideration is execution
time, which might also be considerable.

To answer your question from a theoretical perspective, several authors have
claimed that all XPath 1.0 expressions can be evaluated in a single
top-down, left-to-right pass through a document. A google search will turn
them up. If this is true, and if the context of every XPath expression in a
schema can be determined statically, it should be possible to do your
"business rule" checking in a single pass. Temporary storage would still be
required to validate assertions that call for global or regional
document-knowledge, like IDs or identity constraints, but except for
pathological cases that should be relatively small compared to the entire
document. More important (memory is cheap) it should be fast.

AFAIK there aren't any of those bad puppies out there, but the expert on
this subject is Rick Jelliffe, whom you can reach through
http://www.ascc.net/xml/resource/schematron/schematron.html.

Bob Foster
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top