Updating DTD to agree with its use in doc's

  • Thread starter christopher.c.brewster
  • Start date
C

christopher.c.brewster

A few years ago my department defined a DTD for a projected class of
documents. Like the US Constitution, this DTD has details that are
never actually used, so I want to clean it up. Is there any tool that
looks at existing documents and compares with the DTD they use?

[I can think of other possible uses for such a tool, so I thought
someone might have invented it. I have XML Spy but do not see a feature
that would do this.]

Christopher Brewster
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

A few years ago my department defined a DTD for a projected class of
documents. Like the US Constitution, this DTD has details that are
never actually used, so I want to clean it up. Is there any tool that
looks at existing documents and compares with the DTD they use?

I have written a tool that reads an XML file
and produces a DTD. The DTD covers only those
parts that are actually used in the original
XML file.

http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Generating-a-DTD-from-a-sample-file

It should not be too hard to change the script
so that it reads an arbitrary number of example
files and cumulates knowledge about their structure,
finally producing a DTD that covers all files.

If you don't find another tool and you really
need such a tool, I could write the script for
you. But you should be aware that the language
which is used (XMLgawk) is currently only in an
experimental state.
 
C

christopher.c.brewster

Juergen --

A script to do this would be amazing, if you're interested in doing it.
Here is a further question: I followed the link from the gawk page to
Saxon's site, which led me to a front-end for the program at HiT
Software:

http://www.hitsw.com/xml_utilites/

This utility does not work, however, for a reason that seems to
contradict what it's for: it wants to open the file's DTD! One would
think that this utility, of all utilties, would not need the DTD. It
also wants to pull in all the external entities, but again this seems
pointless for the utility's purpose. Any idea how to get around this?
Thanks for your information.

Chris Brewster
 
C

christopher.c.brewster

OK, I got this working by omitting the reference to the DTD, deleting
entity references, and deleting strings such as &text. But maybe this
utility should ignore these things. Thanks very much for the
information.

Other utilities that would help (which I might make my own versions
of): printing DTDs in structured formats for analysis (such as in table
form), and ways to compare and/or combine related DTDs.
Thanks again...

Chris Brewster
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

A script to do this would be amazing, if you're interested in doing it.

I just had a look at the DTD generator script again.
It looks like the script already does what you want.
On my RedHat Linux for example, I did this to generate
a DTD which covers all the files whose names are passed
on the command line:

gawk -f dtd_generator.awk /usr/share/doc/libxml2-devel-2.6.10/examples/test*.xml

<!ELEMENT doc ( dest | src | parent )* >
<!ELEMENT dest ( #PCDATA ) >
<!ATTLIST dest id CDATA #REQUIRED>
<!ELEMENT src ( #PCDATA ) >
<!ATTLIST src ref CDATA #REQUIRED>
<!ELEMENT parent ( discarded | preserved )* >
<!ELEMENT discarded ( discarded )* >
<!ELEMENT preserved ( child2 | preserved | child1 )* >
<!ELEMENT child2 ( #PCDATA ) >
<!ELEMENT child1 ( #PCDATA ) >

I guess that's what you wanted.
Such a DTD is far from perfect of course.
You should take it as a starting point, rearrange
the sequence of lines and insert comments from your
original (much larger) DTD.
 
P

Peter Flynn

Juergen --

A script to do this would be amazing, if you're interested in doing it.

I did this as part of a migration from TEI SGML to XML. Basically:

a) run nsgmls over the documents and produce ESIS
b) use awk to extract the element type names
c) sort and uniq them
d) use Perl::SGML to read the DTD and list the element type names
e) sort them
f) caseless join the two lists with -a to spit out the non-matches

If you're not using a Unix-based system, I think Cygwin can run these tools.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,838
Latest member
KandiceChi

Latest Threads

Top