partial DTD?

Rainer Gerhards · Jun 21, 2010

Hi All,

please forgive me if this question is too basic. I am an XML beginner (at
best

). For my open source project rsyslog [1] I am trying to find a
better configuration file format. One of the candidates is an XML-based
format [2]. If we take that route, I'd like to have the ability to at least
partially verify a configuration file.

However, in rsyslog nothing is static. Instead, functionality is loaded via
modules, which can be written by third parties. These modules have (and
need) the ability to add configuration parameters to the base set. So I
never know exactly which parameters are valid. This makes it somewhat hard
for me to define a DTD. I understand that probably the best option were to
have a mechanism that permits a plugin to modify the DTD before it is being
used. However, this sounds like a scary amount of work for which there is no
other justification.

So I wonder if it is possible to specify a DTD in a way that says "these are
the rules for the elements specified inside this DTD, but additional
containers may be added and are expected to be valid".

Any advise on this topic would be most welcome.

Thanks,
Rainer

[1] http://www.rsyslog.com
[2] http://lists.adiscon.net/pipermail/rsyslog/2010-June/003749.html

Martin Honnen · Jun 21, 2010

Rainer said:
So I wonder if it is possible to specify a DTD in a way that says "these
are the rules for the elements specified inside this DTD, but additional
containers may be added and are expected to be valid".

I am not aware of any such features for DTDs. The W3C XML schema
specification however allows wildcards
http://www.w3.org/TR/xmlschema-0/#any and schema composition
http://www.w3.org/TR/xmlschema-0/#import so you could consider to use
schemas instead of a DTD.

Peter Flynn · Jun 21, 2010

Manuel said:
Rainer Gerhards escribiï¿½:

Hi All,

please forgive me if this question is too basic. I am an XML beginner
(at best ). For my open source project rsyslog [1] I am trying to
find a better configuration file format. One of the candidates is an
XML-based format [2]. If we take that route, I'd like to have the
ability to at least partially verify a configuration file.

However, in rsyslog nothing is static. Instead, functionality is
loaded via modules, which can be written by third parties. These
modules have (and need) the ability to add configuration parameters to
the base set. So I never know exactly which parameters are valid. This
makes it somewhat hard for me to define a DTD. I understand that
probably the best option were to have a mechanism that permits a
plugin to modify the DTD before it is being used. However, this sounds
like a scary amount of work for which there is no other justification.

So I wonder if it is possible to specify a DTD in a way that says
"these are the rules for the elements specified inside this DTD, but
additional containers may be added and are expected to be valid".

Any advise on this topic would be most welcome.

Click to expand...

Not sure about what is really your problem:

(1) Open set of valid parameter values
(2) Open set of module/parameter names

If (1), the usual answer is to not constrain the set of valid values at
the XML markup level - implement validation checks at the application
level.

If (2), do not use parameter/module names as tag names. Use attribute or
element values instead:
<param name="xxx">value</param>

I'd agree very much with this: it makes it extensible to almost any case.

If your application follows the conventional pattern, there are probably
some base-level settings which apply globally, some which may be
customised on (perhaps) a per-user or per-group basis, and some which
apply to specific modules. This usually means a structure something like
this:

<?xml version="1.0"?>
<!DOCTYPE config SYSTEM "config-v00.dtd">
<config application="rsyslog" version="00" YYYY-MM-DD="2010-06-21">
<base>
<param name="verbosity">full</param>
</base>
<groups>
<group type="user" name="rainer">
<param name="autostart">no</param>
</group>
<group type="app" name="Google">
<param name="domain">reverse-lookup</param>
</group>
</groups>
<modules>
<module name="gui">
<param name="window-system">X</param>
</module>
</modules>
</config>

with config-v00.dtd:

<!ELEMENT config (base,groups,modules)>
<!ATTLIST config application CDATA #FIXED "rsyslog"
version CDATA #REQUIRED
YYYY-MM-DD CDATA #REQUIRED>
<!ELEMENT base (param)+>
<!ELEMENT param (#PCDATA)>
<!ATTLIST param name NMTOKEN #REQUIRED>
<!ELEMENT groups (group)+>
<!ELEMENT group (param)+>
<!ATTLIST group type (user|app|call) #REQUIRED
name CDATA #REQUIRED>
<!ELEMENT modules (module)+>
<!ELEMENT module (param)+>
<!ATTLIST module name NMTOKEN #REQUIRED>

If it's possible to constrain module authors to make their module names
and parameter names stick with A-Za-z0-9\.\-\_ then it makes checking a
lot easier, but if not, make the attribute types CDATA.

///Peter

Rainer Gerhards · Jun 22, 2010

Hello everyone,

many thanks for the good advise, this is very useful for me.

I have also a related question. Probably this should have been the first
question, but I wasn't smart enough to realize that

Is there any
documentatin on best practices for XML based config files available? I tried
to find such things, but I failed. Maybe I used the wrong search words, but
in the majority of cases I got information on .NET but nothing that applies
to XML config files in general.

If you happen to know useful links, I would appreciate if you could tell me.

Thanks again,
Rainer

Peter Flynn · Jun 22, 2010

Rainer said:
I have also a related question. Probably this should have been the
first question, but I wasn't smart enough to realize that Is there
any documentation on best practices for XML based config files
available?

There is plenty on best practice for XML in general, but I have never
seen anything specifically about XML for config files.

Please let us know if you find any (or perhaps when you have finished
the project, write some

///Peter

Rainer Gerhards · Jun 23, 2010

Peter Flynn said:
There is plenty on best practice for XML in general, but I have never
seen anything specifically about XML for config files.

OK, at least I seem not to be too dump to Google

Please let us know if you find any (or perhaps when you have finished
the project, write some

Will do when I find one. I am unsure, though, of a single solution can
become a "best practice". Anyhow, we had a very interesting discussion
yesterday on the rsyslog mailing list. It started with this post:

http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html

which suggest a format that I personally find highly readable, is valid XML
and seems to be quite compact. Together with a SAX interface, it may even
provide a solution to my initial question (even though the solution is
different from the exact question).

Thanks again for all help!
Rainer

Peter Flynn · Jun 23, 2010

Rainer said:
OK, at least I seem not to be too dump to Google

Will do when I find one. I am unsure, though, of a single solution can
become a "best practice". Anyhow, we had a very interesting discussion
yesterday on the rsyslog mailing list. It started with this post:

http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html

which suggest a format that I personally find highly readable, is valid
XML and seems to be quite compact. Together with a SAX interface, it may
even provide a solution to my initial question (even though the solution
is different from the exact question).

David suggests some good points, although ultimately it is always a
trade-off between conciseness and extensibility. Manuel suggested:

do not use parameter/module names as tag names. Use attribute or
element values instead

and in general I agree -- for a config file format -- because when you
come to extend or modify the software, you will find the hard-wired
tagnames become an obstacle to extensibility, and you then need to start
maintaining code to read obsolescent versions of config files. In the
long term, the flexibility of using type and value attributes will make
your life much easier, but I can understand the initial attraction that
David expresses of matching the tagnames to the settings you want to
configure. Have a look at the config files for a large system like
Apache Cocoon, where (IMHO) they have achieved a reasonable balance
between conciseness and flexibility.

David also says:

note that with this approach everything important is in a tag, as
such you can allow arbatrary text to be in the file outside of tags
and just ignore it. This allows such text to be used as comments.

This is very dangerous. It makes the use of an XML editor for managing
the config files extremely difficult, and introduces a number of
unexpected side-effects, including the danger of pernicious mixed
content. Again, the concept of allowing arbitrary text is attractive,
but it will cause serious problems for parsing and validation further
down the line. I strongly recommend against it unless the config file is
going to be extremely simple (in which case XML is probably the wrong
choice anyway).

///Peter

Joe Kesselman · Jun 24, 2010

Peter said:
Again, the concept of allowing arbitrary text is attractive,
but it will cause serious problems for parsing and validation further
down the line. I strongly recommend against it unless the config file is
going to be extremely simple (in which case XML is probably the wrong
choice anyway).

Since you can always drop in  wherever needed, using
text content for commenting isn't really all that much more convenient,
and as Peter says it *is* more fragile. I second his recommendation:
using XML semantics the way they're intended to be used ("say what you
mean") makes for a much better design.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Rainer Gerhards · Jun 24, 2010

Peter Flynn said:
David suggests some good points, although ultimately it is always a
trade-off between conciseness and extensibility. Manuel suggested:

and in general I agree -- for a config file format -- because when you
come to extend or modify the software, you will find the hard-wired
tagnames become an obstacle to extensibility, and you then need to start
maintaining code to read obsolescent versions of config files.

Actually, this is a problem I have in rsyslog all the time. The system is
heavily based on a plug-in architecture. Each plug-in brings in its own
entities, and the config file needs to tie all these entities together.

So far, my idea is that each plugin, during load, registers XML entity names
(or even a partial DTD) with the rsyslog core. Then the core can merge a DTD
from these registrations. More importantly, I can register the
module-specific entity names in a list of valid entity names.

My idea is that I can either read the DOM without validation and do the
validation when building the actual config AST. I see some value in this
approach as I need to do a number of semantic checks that go beyond the
ability of DTDs or schemas (probably involving checking out some system
features via API calls).

Or I can parse the configs with a SAX-type of interface and my callback can
use the core entity registrations while I go along.

In both cases, I can identify the entity based on its name, and use the
rsyslog core table of entity registrations to pass the entity down to the
module in question. While doing so, I can also process some generic
attributes that are based on the module type (we have several types of
modules in rsyslog, each type being something like a superclass, e.g. types
for input and output of messages). The rsyslog core will build an AST node
based on the module type and the module entry point will add module-specific
information it extracts from the attribute values (which is stored as an
opaque block inside the generic AST node).

So this *seems* to work for me without the problems you mentions. HOWEVER,
this is my first time ever at doing such a thing with XML, and my idea is
purely based on reading up XML and library specs. I am not sure if it is a
good idea from the POV of someone with practical experience

So I'd
appreciate to learn if you think this could work - or not...

In the
long term, the flexibility of using type and value attributes will make
your life much easier, but I can understand the initial attraction that
David expresses of matching the tagnames to the settings you want to
configure. Have a look at the config files for a large system like
Apache Cocoon, where (IMHO) they have achieved a reasonable balance
between conciseness and flexibility.

Will do!

David also says:

This is very dangerous. It makes the use of an XML editor for managing
the config files extremely difficult, and introduces a number of
unexpected side-effects, including the danger of pernicious mixed
content. Again, the concept of allowing arbitrary text is attractive,
but it will cause serious problems for parsing and validation further
down the line. I strongly recommend against it unless the config file is
going to be extremely simple (in which case XML is probably the wrong
choice anyway).

Point taken and noted. If I go for that format, I'll NOT promote that option
(but I will not expressively forbid users to handle it that way at their own
risk, aka "I don't care if they use it and it breaks somewhere down the
line").

Thanks again,
Rainer

David Lang · Jun 24, 2010

David also says:

This is very dangerous. It makes the use of an XML editor for managing
the config files extremely difficult, and introduces a number of
unexpected side-effects, including the danger of pernicious mixed
content. Again, the concept of allowing arbitrary text is attractive,
but it will cause serious problems for parsing and validation further
down the line. I strongly recommend against it unless the config file is
going to be extremely simple (in which case XML is probably the wrong
choice anyway).

how is allowing text that's not part of a tag to be treated asa
comment (i.e. ignored by the application) dangerous? it seems to me
that it's just a matter of having the application ignore anything
that's not tags.

you have to be aware of illegal XML characters, but don't you need to
watch for those inside a comment tag anyway?

In this case, the difficult with using 'normal' config file formats is
the need to express pretty arbitrary nesting of things and most config
formats ar really only setup for one level of nesting

David Lang

David Lang · Jun 24, 2010

One key thing to remember here. modules are not created by random
people, they are part of rsyslog itself.

this should mean that there is not as much worry about what some
module author is going to try and do.

each module should be adding relativly little to the available
configuration

1. it adds things to configure the module (which could be tags or
elements depending on if they can be specified more than once)

2. it adds actions that can be used in many places. each action will
have it's configuration (which I think will always be attributes, i
can't think of any case where an action would need to specify anything
more than once)

the problem space in rsyslog is the following

message processing

define inputs (includes defining one or more parsers that convert data
arriving to a standard datastructure,the definition of the parser
itself is not part of the config file)

define filters
filters can involve
nesting
if-then-else
discard this message (don't waste time having anything else
process it)
sets of filters/actions that can be specified separately so that
you can have a complex set and then have other things say if
<simplecondition> do <complex set> without needing to specify
<complexset> more than once

define outputs (or sets of outputs)

it's the nesting and grouping of things that is complex and makes most
config languages not really suitable for the task

David Lang

Joe Kesselman · Jun 24, 2010

David said:
how is allowing text that's not part of a tag to be treated asa
comment (i.e. ignored by the application) dangerous?

In the long term, it's fragile; it will cause confusion and/or breakage
if you later want to put text inside elements rather than in attribute
values. It's also more likely to cause users grief if they want to write
tooling to manipulate those files.

So I would *not* consider relying on ignoring text content to be a "best
practice". That doesn't mean you can't get away with it just that I
think you're going to discover later that it wasn't the best choice.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Peter Flynn · Jun 25, 2010

Rainer said:
Actually, this is a problem I have in rsyslog all the time. The system
is heavily based on a plug-in architecture. Each plug-in brings in its
own entities, and the config file needs to tie all these entities together.

So far, my idea is that each plugin, during load, registers XML entity
names (or even a partial DTD) with the rsyslog core. Then the core can
merge a DTD from these registrations. More importantly, I can register
the module-specific entity names in a list of valid entity names.

My idea is that I can either read the DOM without validation and do the
validation when building the actual config AST. I see some value in this
approach as I need to do a number of semantic checks that go beyond the
ability of DTDs or schemas (probably involving checking out some system
features via API calls).

Or I can parse the configs with a SAX-type of interface and my callback
can use the core entity registrations while I go along.

In both cases, I can identify the entity based on its name, and use the
rsyslog core table of entity registrations to pass the entity down to
the module in question. While doing so, I can also process some generic
attributes that are based on the module type (we have several types of
modules in rsyslog, each type being something like a superclass, e.g.
types for input and output of messages). The rsyslog core will build an
AST node based on the module type and the module entry point will add
module-specific information it extracts from the attribute values (which
is stored as an opaque block inside the generic AST node).

So this *seems* to work for me without the problems you mention.

That's because you haven't encountered them yet

HOWEVER, this is my first time ever at doing such a thing with XML, and
my idea is purely based on reading up XML and library specs. I am not
sure if it is a good idea from the POV of someone with practical
experience So I'd appreciate to learn if you think this could work -
or not...

What are you using to create/edit the config files? A "dumb" text-editor
(eg Notepad)? A "smart" text-editor with XML (eg Emacs/psgml/nxml)? Or a
multi-pane XML editor (eg oXygen, XML Spy, etc)? Or are you creating
them programmatically from within your code? And how will the module
authors create them?

My point was that if you start to do unexpected things with XML, like
allowing random text in places where it's unexpected even if permitted,
people will eventually run up against limitations in their software
which they may not appreciate or understand.

I just noticed that there is a whole chapter on XML in config files in
BenoÃ®t Marchal's book "Applied XML Solutions" (Sams, 2000, 0672320541),
which is probably worth reading.

///Peter

Peter Flynn · Jun 25, 2010

David Lang wrote:
[...]

how is allowing text that's not part of a tag to be treated asa
comment (i.e. ignored by the application) dangerous? it seems to me
that it's just a matter of having the application ignore anything
that's not tags.

But XML is *all* tags. What I think you mean is you want to ignore all
text nodes which have sibling element nodes. Is that correct?

It's not so much a question of having your application "ignore" them:
it's specifying accurately which bits of the parse tree to omit; and
earlier, specifying to the editing application how to signal to the user
that text in certain places is significant but in others not.

You should understand that the markup community has been down this road
a thousand times before, from the late 1980s onwards. I don't know of
any application of XML (or SGML, for that matter) which has ever adopted
this as a matter of practice -- if it has been done, it certainly has
not survived AFAIK. That's not to say you can't; but you would need to
examine what you are proposing *very* carefully before going down that path.

If you *do* manage to make it work, please consider submitting a paper
describing it to the Balisage conference, which is where markup people
love to hear about these things (www.balisage.net).

you have to be aware of illegal XML characters, but don't you need to
watch for those inside a comment tag anyway?

You shouldn't need to: if you are using the proper software (an XML
editor), it won't let you generate such characters in the first place.

I can't emphasize this strongly enough: USE AN XML EDITOR. I know it's
very tempting, especially for the expert programmer, to do it all in
Notepad or whatever, but in the end it will result in tears and
recriminations. You wouldn't write your C or Java in Notepad (at least,
I hope not), so you shouldn't expect to be able to do so with XML: the
syntax is at least as arcane as a programming language, and IMHO a
syntax-directed editor is essential.

In this case, the difficulty with using 'normal' config file formats is
the need to express pretty arbitrary nesting of things and most config
formats are really only setup for one level of nesting

That's an argument for getting the document type design right, not an
argument for allowing arbitrary character data between element nodes in
element content.

I don't think anyone has suggested using what you call "normal" config
file formats (by which I think you mean two-level representations of
java.properties or X resources files) -- my earlier example specifically
avoided doing that, and BenoÃ®t Marchal's chapter I just referred to
explicitly makes the same point. XML is *designed* to handle arbitrarily
deep nesting -- have a look at any standard application like DocBook or TEI.

///Peter

Rainer Gerhards · Jun 25, 2010

David Lang said:
One key thing to remember here. modules are not created by random
people, they are part of rsyslog itself.

this should mean that there is not as much worry about what some
module author is going to try and do.

Ah, that's not really right. While most of the modules originated from the
project, there are some (omoracle for example) that are just distributed for
convenience. There most probably also exist modules the rsyslog team has
never heard about. One reason to introduce a plugin architecture was to
enable third parties to add functionality.

Rainer
PS: I know the comment is a bit off-topic here, but I thought this is
important for the overall picture.

Rainer Gerhards · Jun 25, 2010

Peter Flynn said:
That's because you haven't encountered them yet

That's why I asked (with 0 implementations, you always have 0 problems

)

What are you using to create/edit the config files? A "dumb" text-editor
(eg Notepad)? A "smart" text-editor with XML (eg Emacs/psgml/nxml)? Or a
multi-pane XML editor (eg oXygen, XML Spy, etc)? Or are you creating
them programmatically from within your code? And how will the module
authors create them?

We must assume that one common case is a sysadmin on a stripped-down system
with just plain old vi at his hands.

My point was that if you start to do unexpected things with XML, like
allowing random text in places where it's unexpected even if permitted,
people will eventually run up against limitations in their software
which they may not appreciate or understand.

I already ruled that out...

I just noticed that there is a whole chapter on XML in config files in
BenoÃ®t Marchal's book "Applied XML Solutions" (Sams, 2000, 0672320541),
which is probably worth reading.

That's a good pointer. However, digesting all the information from this
thread, other discussions and adding a requirement I simply had forgotten
[1], it turns out that XML is probably not a solution for rsyslog config
files. That doesn't mean the discussion was useless. Right the opposite is
true: without all your good comments, I'd probably not been able to see XML
is not right for this specific job and I may have invested a lot of time in
unfruitful work

If you are interested in more detail of the reasons, it requires a lot of
explanation. For those interested, I provide it in [1] and the follow-up
posts to it.

Thanks again,
Rainer

[1] http://lists.adiscon.net/pipermail/rsyslog/2010-June/003830.html

David Lang · Jun 25, 2010

What are you using to create/edit the config files? A "dumb" text-editor
(eg Notepad)? A "smart" text-editor with XML (eg Emacs/psgml/nxml)? Or a
multi-pane XML editor (eg oXygen, XML Spy, etc)? Or are you creating
them programmatically from within your code? And how will the module
authors create them?

the answer is 'all of the above' ;-)

I expect that most of the time they are going to be created by a dumb
text editor (vi), but it would be useful to have the config file
definition done in such a way thta you could take an off-the-shelf
smart editor, point it at the DTD/schema and have it help the user get
the config correct.

My point was that if you start to do unexpected things with XML, like
allowing random text in places where it's unexpected even if permitted,
people will eventually run up against limitations in their software
which they may not appreciate or understand.

ok, my assumption was that with the definition of XML as a markup
language, all XML editors would handle mixed text and tags. since the
configs aren't expected to use anything but tags, the text portion
could be used for comments.

I just noticed that there is a whole chapter on XML in config files in
Benoît Marchal's book "Applied XML Solutions" (Sams, 2000, 0672320541),
which is probably worth reading.

I'll see ifI can track down a copy

David Lang · Jun 25, 2010

David Lang wrote:

[...]

how is allowing text that's not part of a tag to be treated asa
comment (i.e. ignored by the application) dangerous? it seems to me
that it's just a matter of having the application ignore anything
that's not tags.

Click to expand...

But XML is *all* tags. What I think you mean is you want to ignore all
text nodes which have sibling element nodes. Is that correct?

what I mean is the ability to do
<tag>
<tag param=value>
<tag/>
comment, this is why I did this
</tag>

It's not so much a question of having your application "ignore" them:
it's specifying accurately which bits of the parse tree to omit; and
earlier, specifying to the editing application how to signal to the user
that text in certain places is significant but in others not.

if all text is ignored (i.e. not processed by the application in
defining it's config) it's not a matter of ignoring text in some
places but not in others.

You should understand that the markup community has been down this road
a thousand times before, from the late 1980s onwards. I don't know of
any application of XML (or SGML, for that matter) which has ever adopted
this as a matter of practice -- if it has been done, it certainly has
not survived AFAIK. That's not to say you can't; but you would need to
examine what you are proposing *very* carefully before going down that path.
noted

If you *do* manage to make it work, please consider submitting a paper
describing it to the Balisage conference, which is where markup people
love to hear about these things (www.balisage.net).

well, it 'works' in that I've been doing this for several years, but
it seems such a trivial thing that I'm not sure how I would write it
up.

You shouldn't need to: if you are using the proper software (an XML
editor), it won't let you generate such characters in the first place.

I can't emphasize this strongly enough: USE AN XML EDITOR. I know it's
very tempting, especially for the expert programmer, to do it all in
Notepad or whatever, but in the end it will result in tears and
recriminations. You wouldn't write your C or Java in Notepad (at least,
I hope not), so you shouldn't expect to be able to do so with XML: the
syntax is at least as arcane as a programming language, and IMHO a
syntax-directed editor is essential.

for a system administration tool like syslog, this is not a
requirement that we can impose. the system may not _have_ a XML aware
editor on it.

what rsyslog needs is a config file language that can be edited
without any special editor, but we were thinking that by using XML we
could benefit from the XML aware editors that exist by defining a DTD/
schema that would effectively turn the generic XML editor into a
rsyslog aware editor

That's an argument for getting the document type design right, not an
argument for allowing arbitrary character data between element nodes in
element content.

I don't think anyone has suggested using what you call "normal" config
file formats (by which I think you mean two-level representations of
java.properties or X resources files) -- my earlier example specifically
avoided doing that, and Benoît Marchal's chapter I just referred to
explicitly makes the same point. XML is *designed* to handle arbitrarily
deep nesting -- have a look at any standard application like DocBook or TEI.

the discussion on a config file format for rsyslog did not start with
XML, they wandered around and drifted towards XML because it could
handle the nesting well (overnight we identified the need to do if-
then-else which I don't see a good way to do in XML). There have been
suggestions that what we are trying to do is not a good fit for XML
and therefor we should just use a 'normal' config language (for
example the INI format)

David Lang

Peter Flynn · Jun 25, 2010

David said:
I expect that most of the time they are going to be created by a dumb
text editor (vi), but it would be useful to have the config file
definition done in such a way thta you could take an off-the-shelf
smart editor, point it at the DTD/schema and have it help the user get
the config correct.

You don't even need an editor to do that; any standalone validating
parser can do it (onsgmls, rxp, ...)

ok, my assumption was that with the definition of XML as a markup
language, all XML editors would handle mixed text and tags.

They will, for some value of "handle".

since the configs aren't expected to use anything but tags, the text
portion could be used for comments.

I think there is a misunderstanding here. An element in XML usually
consists of two tags, a start-tag and an end-tag. Between them goes
either (a) just text, or (b) just other elements, or (c) a mixture
(Mixed Content, like paragraphs in HTML are made of). There is a special
case called an EMPTY Element which contains nothing at all, and is
allowed to use the special syntax of terminating the start-tag with />

"The text portion" you refer to is (c). In effect you want a content
model which is the inverse of the normal XML application, where nothing
is allowed to contain any text except the lowest points in the
hierarchy. What you are looking for would allow text (a) everywhere *or
(b) everywhere *except* the lowest points in the hierarchy (if all your
elements were declared EMPTY and you used attributes for the data).

While every XML editor will accept this because it is perfectly within
the rules, it is definitely an extreme edge case, so support for it will
be minimal: see below for why.

(Note that some XML applications -- including OOXML -- are at the
opposite extreme and have no mixed content at all, not even inside
paragraphs. Again this is perfectly within the rules, just harder to
work with.)

Don't forget that newlines are generally insignificant in XML: with
other white-space they can be normalised to single spaces, under certain
conditions. You CANNOT therefore rely on all processes treating
beautifully-maintained "pretty-printed" XML as sacrosanct.

So why the minimal support?

If you allow PCDATA (text with no element markup) in between elements,
the rules of XML will not allow you to specify sequence: the mixture has
to allow elements and text *in any order*. So previously (using my
example of the other day) you might have a config outer (root) element
type, containing (in sequence) base, groups, and modules; if you allow
interspersed text, the content model becomes "any mixture of text, base,
groups, and modules, IN ANY ORDER" (XML Spec, 3.2.2, Mixed Content).

This makes it hard-to-impossible to maintain any kind of structure to a
document, which is why this kind of definition has been examined and
rejected. As I said, there's nothing to stop you, but eventually you'll
stop yourself.

You *could* opt to use SGML instead

where sequential mixed content
was permitted; but one of the reasons we removed it from XML was the
difficulty of creating, maintaining, and processing it.

///Peter

Peter Flynn · Jun 25, 2010

David said:
David Lang wrote:

[...]

how is allowing text that's not part of a tag to be treated asa
comment (i.e. ignored by the application) dangerous? it seems to me
that it's just a matter of having the application ignore anything
that's not tags.

Click to expand...

But XML is *all* tags. ï¿½What I think you mean is you want to ignore all
text nodes which have sibling element nodes. Is that correct?

Click to expand...

what I mean is the ability to do
<tag>
<tag param=value>
<tag/>
comment, this is why I did this
</tag>
<tag>

I just replied to your earlier post explaining why this is so hard to
manage. Basically XML cannot be used to constrain the order in which
elements appear, if you permit arbitrary text to occur between them.
It's prohibited by the Spec, and for good reason (unmaintainability).

if all text is ignored (i.e. not processed by the application in
defining it's config) it's not a matter of ignoring text in some
places but not in others.

That would work, but you cannot specify element order.

well, it 'works' in that I've been doing this for several years, but
it seems such a trivial thing that I'm not sure how I would write it
up.

It only "works" in the sense that you have to create the documents by
imposing a human-mediated constraint of order on the appearance of
elements. If you define the DTD or Schema to allow intervening text, an
XML editor will have to permit the elements to occur in any order.

for a system administration tool like syslog, this is not a
requirement that we can impose. the system may not _have_ a XML aware
editor on it.

In that case I recommend not using XML at all.

what rsyslog needs is a config file language that can be edited
without any special editor, but we were thinking that by using XML we
could benefit from the XML aware editors that exist by defining a DTD/
schema that would effectively turn the generic XML editor into a
rsyslog aware editor

That is precisely what it will do, but NOT for the case where you allow
text to occur arbitrarily between elements, IFF you need to preserve the
order in which elements can occur.

If order is not important (and it's arguable that in a config file, it
might well not be significant), then what you propose will work, but you
must be VERY careful not to make the application dependent on the
occurrence of newlines (see my previous post) because in those
circumstances, some editors opening your example:

<tag>
<tag param=value>
<tag/>
comment, this is why I did this
</tag>
<tag>

will save it as

<tag> <tag param=value> <tag/> comment, this is why I did this
</tag> <tag>

(that's all on one line: some newsreaders may break it up). This may not
be what you want.

the discussion on a config file format for rsyslog did not start with
XML, they wandered around and drifted towards XML because it could
handle the nesting well (overnight we identified the need to do if-
then-else which I don't see a good way to do in XML). There have been
suggestions that what we are trying to do is not a good fit for XML
and therefor we should just use a 'normal' config language (for
example the INI format)

Yep, that might be easier to do. But you could consider the alternative
of allowing rsyslog comments where you want them, but using XML comment
syntax:

<tag>
<tag param=value>
<tag/>

</tag>
<tag>

That will work just fine, because then you can use properly constrained
content models; but the sysadmin with only vi available must remember
that the XML comment syntax would be compulsory.

///Peter

XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
DTD validation	7	Jan 15, 2011
Programmatic Alteration of Internal DTD Subset	2	Oct 10, 2008
Problem with DTD declaration	8	Aug 7, 2008
lxml precaching DTD for document verification.	3	Nov 27, 2011
exec with partial globals	5	Oct 30, 2012
DTD Parser	0	Jul 24, 2006
Implementing a DTD-based XML validator	1	May 29, 2009

partial DTD?

Rainer Gerhards

Martin Honnen

Peter Flynn

Rainer Gerhards

Peter Flynn

Rainer Gerhards

Peter Flynn

Joe Kesselman

Rainer Gerhards

David Lang

David Lang

Joe Kesselman

Peter Flynn

Peter Flynn

Rainer Gerhards

Rainer Gerhards

David Lang

David Lang

Peter Flynn

Peter Flynn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads