T
Tom Anderson
Afternoon all,
Call me mad, but i'm interested in writing an XML validator. Not as part
of a parser, but operating on DOM-like objects in a program. Basically, i
want to write a function createElement that looks a bit like:
Node a, b, c; // create these somehow
Element list = createElement("xhtml", new Node[] {a, b, c});
Where createElement is able to determine whether {a, b, c} is a valid
sequence of child elements for an xhtml element, and so throw an
exception of something if it isn't.
The idea would be to parse a DTD in order to create objects representing
the content model, then use those to validate the nodes.
The XML spec says:
More formally: a finite state automaton may be constructed from the
content model using the standard algorithms, e.g. algorithm 3.5 in
section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such
algorithms, a follow set is constructed for each position in the regular
expression (i.e., each leaf node in the syntax tree for the regular
expression); if any position has a follow set in which more than one
following position is labeled with the same element type name, then the
content model is in error and maybe reported as an error.
Firstly, roughly how hard is this? Expressed in, say,
milli-Dijkstra's-algorithms - 5000? 20 000? 100 000?
Secondly, i'm not keen to rush out and buy Aho et al's no doubt wonderful
book on compilers just so i can do this. Can anyone direct me to anything
i can read online where i can learn about this? That could be in English
or source code - presumably, there are numerous open-source projects which
have implemented XML validators, right?
It occurs to me that i could avoid having to write the validator myself by
using a grotesque hack - if i can map node types to strings, i can express
a node sequence as a string, and a content model as a regular expression,
and then just let a standard regexp library do the heavy lifting. In
python, operating on standard DOM objects:
def validateAsParagraph(nodelist):
nodeString = "".join(map(lambda node: "<" + node.nodeName + ">", nodelist))
pPattern = re.compile("(?:<(?:#PCDATA|br|span|bdo|map|tt|i|b|big|small|em|strong|dfn|code|q|samp|kbd|var|cite|abbr|acronym|sub|sup|input|select|textarea|label|button|ins|del|script)>)*")
m = pPattern.match(nodeString)
return (m != None) and (m.end() == len(nodeString))
I can't decide if this is brilliant or revolting, or both.
tom
Call me mad, but i'm interested in writing an XML validator. Not as part
of a parser, but operating on DOM-like objects in a program. Basically, i
want to write a function createElement that looks a bit like:
Node a, b, c; // create these somehow
Element list = createElement("xhtml", new Node[] {a, b, c});
Where createElement is able to determine whether {a, b, c} is a valid
sequence of child elements for an xhtml element, and so throw an
exception of something if it isn't.
The idea would be to parse a DTD in order to create objects representing
the content model, then use those to validate the nodes.
The XML spec says:
More formally: a finite state automaton may be constructed from the
content model using the standard algorithms, e.g. algorithm 3.5 in
section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such
algorithms, a follow set is constructed for each position in the regular
expression (i.e., each leaf node in the syntax tree for the regular
expression); if any position has a follow set in which more than one
following position is labeled with the same element type name, then the
content model is in error and maybe reported as an error.
Firstly, roughly how hard is this? Expressed in, say,
milli-Dijkstra's-algorithms - 5000? 20 000? 100 000?
Secondly, i'm not keen to rush out and buy Aho et al's no doubt wonderful
book on compilers just so i can do this. Can anyone direct me to anything
i can read online where i can learn about this? That could be in English
or source code - presumably, there are numerous open-source projects which
have implemented XML validators, right?
It occurs to me that i could avoid having to write the validator myself by
using a grotesque hack - if i can map node types to strings, i can express
a node sequence as a string, and a content model as a regular expression,
and then just let a standard regexp library do the heavy lifting. In
python, operating on standard DOM objects:
def validateAsParagraph(nodelist):
nodeString = "".join(map(lambda node: "<" + node.nodeName + ">", nodelist))
pPattern = re.compile("(?:<(?:#PCDATA|br|span|bdo|map|tt|i|b|big|small|em|strong|dfn|code|q|samp|kbd|var|cite|abbr|acronym|sub|sup|input|select|textarea|label|button|ins|del|script)>)*")
m = pPattern.match(nodeString)
return (m != None) and (m.end() == len(nodeString))
I can't decide if this is brilliant or revolting, or both.
tom