PI's content depending on document encoding?

C

Christian Roth

Hello,

I am merely asking this for my own understanding:

Processing instruction's data part is not entity-aware, i.e. character
and numercial entities are not resolved at parsing time. E.g.,

<?mypi &lt;par/&gt; ?>

delivers as data part the String(!) "&lt;par/&gt;".

This effectively means that the possivle character contents of a PI is
limited by the document's encoding, since numerical entities cannot be
used to express characters outside of this encoding.

Consequently, this means that writing a PI and using any character
outside the ASCII range is bound for trouble when submitting such a
document (originally, say, in UTF-8) to an unknown XML workflow, since
intermediary stages may decide to serialize the document to e.g. ASCII
and therefore will lose any characters outside that range within PIs.


Is my understanding correct?

Regards, Christian.
 
R

Richard Tobin

Christian Roth said:
This effectively means that the possivle character contents of a PI is
limited by the document's encoding, since numerical entities cannot be
used to express characters outside of this encoding.

Yes.

(Well, it is *possible* to put arbitrary characters in a PI by means
of an entity:

<!DOCTYPE foo [
<!ENTITY pi "<?pi here is a euro symbol: € ?>">
]>
<foo>
&pi;
</foo>

but this is not practical in most circumstances.)
Consequently, this means that writing a PI and using any character
outside the ASCII range is bound for trouble when submitting such a
document (originally, say, in UTF-8) to an unknown XML workflow, since
intermediary stages may decide to serialize the document to e.g. ASCII
and therefore will lose any characters outside that range within PIs.

This is equally true for element and attribute names, since character
references cannot be used there either.

-- Richard
 
C

Christian Roth

Richard Tobin said:
This is equally true for element and attribute names, since character
references cannot be used there either.

Thank you very much for the detailed answer, Richard - highly
appreciated!

Do you know if there is a technical reason(ing) for not having the
parser resolve (at least) numerical entities in PI data, element and
attribute names (and I think comments as well) in XML? Would this
possibly create ambiguous states during parsing?

Regards, Christian.
 
R

Richard Tobin

Christian Roth said:
Do you know if there is a technical reason(ing) for not having the
parser resolve (at least) numerical entities in PI data, element and
attribute names (and I think comments as well) in XML?

All this is inherited from SGML. So the best I can do is "historical
reasons". I imagine it just wasn't considered important enough.

-- Richard
 
J

Joseph Kesselman

BTW, Tim Bray's "Annotated XML Specification" -- while a bit out of date
since it's based on the XML 1.0 Recommendation -- is a wonderful
resource for understanding the rationalle behind some of the design
decisions, not to mention figuring out what some of the
less-than-obvious phrases actually mean.

A copy can be found at http://www.xml.com/axml/axml.html. It uses
Frames, but that shouldn't be a problem for most modern browsers.
 
P

Peter Flynn

Richard said:
All this is inherited from SGML. So the best I can do is "historical
reasons". I imagine it just wasn't considered important enough.

It's a little more important than that :) The content of a PI was
designed to be used to send "system-specific markup to an application
in its own language" (SGML: Clause 8, Goldfarb, p.339). Goldfarb also
recommended making them all declared entities so that the system data
would be confined to the prolog and not occur in the document body.

But line 4 is very specific: "No markup is recognized in system data
other than the delimiter that would terminate it." The point is that
PIs are *not* for SGML or XML data: they are for instructions in some
other language. Therefore they are not subject to markup recognition.

///Peter
 
R

Richard Tobin

All this is inherited from SGML. So the best I can do is "historical
reasons". I imagine it just wasn't considered important enough.
[/QUOTE]
It's a little more important than that :)

I must say I find this argument completely unconvincing, even if it
is correct.
But line 4 is very specific: "No markup is recognized in system data
other than the delimiter that would terminate it."

Yes, but *why not*?
The point is that
PIs are *not* for SGML or XML data: they are for instructions in some
other language. Therefore they are not subject to markup recognition.

Could you explain the "therefore" in that? I see no reason why
instructions for some other language are in less need of a character
escaping mechanism that any other part of the document. That the
escaping mechanism is considered markup doesn't make any difference to
whether it is needed in PIs.

If, as you say, it's important, just what bad thing would happen if
it were allowed?

-- Richard
 
J

Joseph Kesselman

Richard said:
Yes, but *why not*?

For whys, see the annotated XML spec. If it isn't explained there, the
answer is probably "because that's the way the spec describes it, either
inherited from SGML or because it seemed as good an answer as any."

Remember, XML is very much software engineering rather than computer
science. The answer is often going to be "nobody suggested a better
alternative before the document went to REC."
 
P

Peter Flynn

Richard said:
Yes, but *why not*?

Because that's the way that SGML was designed. I suspect that if we
want further details, we'll have to ask Charles Goldfarb or one of
his collaborators personally.
Could you explain the "therefore" in that? I see no reason why
instructions for some other language are in less need of a character
escaping mechanism that any other part of the document.

Quite possibly, but there isn't any way that the design of the system
data rules for PIs could possibly account for every escaping mechanism
used by every target-device control language then and in the future.
That the
escaping mechanism is considered markup

Only if it is the XML escaping mechanism using the & character.
I can use (for example) LaTeX's \"a to produce an &auml; and
there's no problem.
doesn't make any difference to
whether it is needed in PIs.

I agree it's suboptimal: suppose I wanted to pass through the control
characters for driving a printer direct (ESC style char escapes) and
wanted ESC NUL DC3 SYN ACK. In practice, no-one in their right minds
is going to try and hard-wire printer escapes into XML -- although
goddess knows how many have tried to insert ^L into HTML files hoping
to force a page throw :)

But the point is that system data is not XML markup, so it's not
subject to being parsed as such. For my 2¢, this ought to mean that
it shouldn't be subject to being restricted to XML Characters either,
but that's a much longer and more complex problem.

And in any case, the OP said:
<?mypi &lt;par/&gt; ?>

delivers as data part the String(!) "&lt;par/&gt;".

What's wrong with said:
This effectively means that the possivle character contents of a PI is
limited by the document's encoding, since numerical entities cannot be
used to express characters outside of this encoding.

If you restrict the document's encoded character data content, then you
restrict the document's character data content, PIs and all. I can't see
PIs being exempt from the rules which by definition must govern the
whole document.
Consequently, this means that writing a PI and using any character
outside the ASCII range is bound for trouble when submitting such a
document (originally, say, in UTF-8) to an unknown XML workflow, since
intermediary stages may decide to serialize the document to e.g. ASCII
and therefore will lose any characters outside that range within PIs.

An intermediate stage which breaks the rules by changing encoding is
just asking for trouble.
If, as you say, it's important, just what bad thing would happen if
it were allowed?

It would mean just more trouble in writing software. In effect you are
asking for parsers to pass the byte content of a PI untouched and
uninterpreted. Or are you asking for it to be parsed and have character
entity references expanded? All entity references? Markup?

I suspect this is such a can of worms that it's best left untouched:
XML was designed as a cut-down of SGML with all the groodles removed.
Maybe an XML 2.* can revisit this.

///Peter
 
R

Richard Tobin

Maybe an XML 2.* can revisit this.

There will never be an XML 2.*. The replacement for XML will be
something completely different, uninhibited by any SGML compatibility
considerations (but probably inhibited by being related to something
else).

-- Richard
 
J

Joseph Kesselman

Richard said:
There will never be an XML 2.*. The replacement for XML will be
something completely different, uninhibited by any SGML compatibility
considerations (but probably inhibited by being related to something
else).

"Never say never"... I suspect that XML 2.x will be more of a Standards
effort (recognize and reconcile best practices through the whole maze of
interconnected documents) than a Recommendation effort as the W3C now
approaches it, and I expect it won't happen soon, but I do think it'll
happen eventually. (Consider how long it took before the ANSI C Standard
came out!)

By then there may be another hot topic... but I think XML's here for the
duration. Again, consider how long GML and SGML were around, even though
most of you young whippersnappers didn't know about 'em.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top