Implementing a DTD-based XML validator

Tom Anderson · May 29, 2009

Afternoon all,

Call me mad, but i'm interested in writing an XML validator. Not as part
of a parser, but operating on DOM-like objects in a program. Basically, i
want to write a function createElement that looks a bit like:

Node a, b, c; // create these somehow
Element list = createElement("xhtml

", new Node[] {a, b, c});

Where createElement is able to determine whether {a, b, c} is a valid
sequence of child elements for an xhtml

element, and so throw an
exception of something if it isn't.

The idea would be to parse a DTD in order to create objects representing
the content model, then use those to validate the nodes.

The XML spec says:

More formally: a finite state automaton may be constructed from the
content model using the standard algorithms, e.g. algorithm 3.5 in
section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such
algorithms, a follow set is constructed for each position in the regular
expression (i.e., each leaf node in the syntax tree for the regular
expression); if any position has a follow set in which more than one
following position is labeled with the same element type name, then the
content model is in error and maybe reported as an error.

Firstly, roughly how hard is this? Expressed in, say,
milli-Dijkstra's-algorithms - 5000? 20 000? 100 000?

Secondly, i'm not keen to rush out and buy Aho et al's no doubt wonderful
book on compilers just so i can do this. Can anyone direct me to anything
i can read online where i can learn about this? That could be in English
or source code - presumably, there are numerous open-source projects which
have implemented XML validators, right?

It occurs to me that i could avoid having to write the validator myself by
using a grotesque hack - if i can map node types to strings, i can express
a node sequence as a string, and a content model as a regular expression,
and then just let a standard regexp library do the heavy lifting. In
python, operating on standard DOM objects:

def validateAsParagraph(nodelist):
nodeString = "".join(map(lambda node: "<" + node.nodeName + ">", nodelist))
pPattern = re.compile("(?:<(?:#PCDATA|br|span|bdo|map|tt|i|b|big|small|em|strong|dfn|code|q|samp|kbd|var|cite|abbr|acronym|sub|sup|input|select|textarea|label|button|ins|del|script)>)*")
m = pPattern.match(nodeString)
return (m != None) and (m.end() == len(nodeString))

I can't decide if this is brilliant or revolting, or both.

tom

Stanimir Stamenkov · Jun 3, 2009

Fri, 29 May 2009 13:38:08 +0100, /Tom Anderson/:

Call me mad, but i'm interested in writing an XML validator. Not as part
of a parser, but operating on DOM-like objects in a program.

JAXP 1.3 provides validation API which is implemented [1] by Xerces2
and which could operate on already parsed and built DOM.

Can anyone direct me
to anything i can read online where i can learn about this? That could
be in English or source code - presumably, there are numerous
open-source projects which have implemented XML validators, right?

You could read the Xerces2 Implementation API documentation [2] -
packages like org.apache.xerces.impl.dtd.models and
org.apache.xerces.impl.xs.models. You could browse the sources [3]
as well.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/javax/xml/validation/package-summary.html
[2] http://xerces.apache.org/xerces2-j/javadocs/xerces2/index.html
[3] http://xerces.apache.org/xerces2-j/source-repository.html

XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
In a DTD, how do I specify that an element contains arbitrary othermarkup?	10	Jul 16, 2010
"walk over," and XPath-based substitutions?	2	Apr 6, 2013
Another .dtd to xml-schema problem	3	Feb 21, 2007
Lib to generate XML/JSON[P] output from a DTD/XSD/JSON Schema/etc	1	Feb 14, 2013
Problem with DTD declaration	8	Aug 7, 2008
XML order does not always match DTD	10	Sep 13, 2007
DTD to XML Schema	2	Dec 9, 2005

Implementing a DTD-based XML validator

Tom Anderson

Stanimir Stamenkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads