Nobody said:
Dealing with whitespace may be trivial (unless the underlying I/O code
is line-oriented, as XML allows linefeeds within tags), but it's
frequently omitted.
The implementation details of the IO part of a parser are irrelevant.
Whether the IO is line-oriented or not, the IO code should never insert or
ommit information, which means that a parser only handles the information
provided by a stream.
It's less trivial to deal with the fact that attributes may appear in
any order.
I don't believe that constitutes a real problem. For example, consider a
XML-based file format which consists of a single element "element" which
may have a set of attributes labelled "alpha", "beta" an "gamma". For
that language, a valid document could be something like:
<element alpha="true" />
If the language accepts repeated attributes then a possible (and crude)
production[1] would be something like:
<example>
document = "<" "element" *<tags> "/" ">"
tag> = "alpha" "=" text string
= "beta" "=" text string
= "gamma" "=" text_string
</example>
The support for the tags specified in the above production in a LL parser,
ignoring error handling, may be around 3 states (6, if we count a "ghost"
state to push the attribute values into a data structure).
If, instead, the attributes must follow a specific order (alpha, beta,
gamma) where:
- each attribute can either be present or not
- an attribute appearing out of it's rightful place is considered an error
then, the following production applies:
<example>
document = "<" "element" *1alpha_tag *1beta_tag *1gamma_tag "/" ">"
alpha_tag = "alpha" "=" text string
beta_tag = "beta" "=" text string
gamma_tag = "gamma" "=" text_string
</example>
The support for the tags specified in the above production in a LL parser,
ignoring error handling, is yet again achieved by adding 3 states (6, with
the "ghost" states).
If your language accepts any possible attribute combination then the
production starts to become a bit more demanding. Yet, you only need to
deal with this if you specifically wish that your grammar accepts your
attributes in any random order, which means that you are creating your own
problem.
Nonetheless, notice that you will be faced with the exact same problem if
you wish to rely on a generic parser instead of one which you develop
yourself. In that case, you will be faced with a more demanding problem,
as you are forced to deal with nodes in a tree structure instead of a
simple stream of terminal tokens.
Rui Maciel
[1]
http://tools.ietf.org/html/rfc5234