X
Xamle Eng
One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.
But there is another way.
The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.
No more text nodes.
The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types. The text
and tail attributes are always strings. This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.
This form of API is very suitable for data-oriented XML applications
that don't use mixed elements: for leaf elements just use the .text
attribute and ignore everything else. Container elements use the
element's children which are always other elements. The text attribute
of an element can be ignore if it has children. No need to explicitly
skip it. Tails are always ignored, unless used to indent the output,
which can be done easily without disturbing the rest of the data.
For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.
The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.
Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?
XE
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.
But there is another way.
The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.
No more text nodes.
The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types. The text
and tail attributes are always strings. This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.
This form of API is very suitable for data-oriented XML applications
that don't use mixed elements: for leaf elements just use the .text
attribute and ignore everything else. Container elements use the
element's children which are always other elements. The text attribute
of an element can be ignore if it has children. No need to explicitly
skip it. Tails are always ignored, unless used to indent the output,
which can be done easily without disturbing the rest of the data.
For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.
The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.
Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?
XE