Parsing multiple XML trees?

David Svoboda · Dec 15, 2005

I have a server program that takes commands and acts on them. The
server program can also take these commands from an input file or
standard input (mainly for testing purposes). As such, I often have
files full of input commands to feed to the server.

Right now the commands that the server takes are well-defined, but not
in XML. Since the commands are not self-delimiting, I have to prepend
each command with a 'length' number indicating how many chars the
command takes.

I would like to change the server to accept XML commands, and provide
a DTD (or Schema or RelaxNG or ...) to ensure that the server only
receives valid commands.

My question is this: Can I take the length number out of my input
files & network commands? Since XML is self-delimiting (tags must
balance) this should be possible. However, every time I try to run a
Xerces (Java) parser on a file full of XML commands (with no length
info), it silently discards all but the first command.

I guess what I want to know is, can Xerces take an input stream full
of multiple XML trees and give me each XML tree in turn w/o discarding
any of them? (I can use either SAX or DOM or SAX2 to accomplish this.)

Several friends have suggested that I wrap the entire input file
around a <root> tag, which would make the series of commands into one
big giant happy XML file. I suppose that could work, but that has
several problems: (1) it requires a different DTD to handle multiple
commands than it does to handle one command. (2) as a server it
precludes me from using DOM since I need to act on each command before
the entire stream has been parsed.

Maybe this is the wrong forum to ask, but it's not clear what the
right forum would be. Is this feature covered in SAX? DOM? Is it
specific to Xerces?

~David Svoboda

Martin Honnen · Dec 15, 2005

David Svoboda wrote:

However, every time I try to run a
Xerces (Java) parser on a file full of XML commands (with no length
info), it silently discards all but the first command.

Several friends have suggested that I wrap the entire input file
around a <root> tag, which would make the series of commands into one
big giant happy XML file. I suppose that could work, but that has
several problems: (1) it requires a different DTD to handle multiple
commands than it does to handle one command. (2) as a server it
precludes me from using DOM since I need to act on each command before
the entire stream has been parsed.

One of the requirements of markup to be called XML is a single root
element thus if you want to process some markup with XML tools then you
need to have a single root element e.g.
<commands>
<command />
<command />
</commands>
if you have e.g.
<command />
<command />
then that is not XML as that is not well-formed markup.

David Svoboda · Dec 15, 2005

Martin said:
David Svoboda wrote:

One of the requirements of markup to be called XML is a single root
element thus if you want to process some markup with XML tools then you
need to have a single root element e.g.
<commands>
<command />
<command />
</commands>
if you have e.g.
<command />
<command />
then that is not XML as that is not well-formed markup.

So does that mean if I'm running a server I can only send it one XML
command? That seems to mean that sending multiple XML commands is invalid.

What if a client sends two XML commands really quickly, and my server
'forgets' the second one? How does my server 'pop' exactly one XML
command off the socket?
~Dave

Andrew Schorr · Dec 16, 2005

David said:
Maybe this is the wrong forum to ask, but it's not clear what the
right forum would be. Is this feature covered in SAX? DOM? Is it
specific to Xerces?

I'm not sure this will be at all helpful, but we confronted this same
issue when designing an
XML parsing extension to gawk. If XMLMODE is positive, we allow only
a single XML document
to be parsed. But if XMLMODE is negative, we parse a stream of
concatenated documents
(issuing an "ENDDOCUMENT" event between documents).

We do this using the expat parser. The basic approach is to keep
parsing until an error
is encountered. When we get a parse error, we check to see whether the
current parse
depth is 0 and more than 0 elements have been parsed already. If so,
we infer that
we are done parsing a single XML document, so we issue the
"ENDDOCUMENT" event
and try to proceed with the next document. We do that by calling the
XML_GetCurrentByteIndex()
function to determine where in the input the error occurred. We use
that offset value to
identify where in the input to attempt to start parsing a new document.

If that's of any interest, you can take a look at the code here:
http://sourceforge.net/projects/xmlgawk
This could be directly useful (if you want to use xgawk's XML
extension), or the code
may serve as a guide for how to implement this in your environment.

Regards,
Andy

How to speed up XML reading	11	Sep 11, 2012
Find and count strings of text from multiple files	17	Dec 16, 2021
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
Trees in XML	9	Oct 22, 2007
How to use multiple functions	1	Jan 28, 2021
Identifying if the program I have is python and then decompiling	0	May 29, 2022
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023

Parsing multiple XML trees?

David Svoboda

Martin Honnen

David Svoboda

Andrew Schorr

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads