Suppressing character entity transformation

R. P. · Apr 17, 2007

I wonder how to indicate in a stylesheet that character entities in an
element are not to be transformed as would be the case in XML-to-XML
transforms. I want to keep those & " and other character
entities in the output as they are in the input. My stylesheet converts
them the '&' etc., making the output XML not formed properly.

R.P

Joe Kesselman · Apr 17, 2007

If you set the output to XML -- or HTML -- problematic characters should
get converted back to entity (or numeric-character-reference)
representation. If you set text output mode, you're on your own.

Want to provide an example that demonstrates the problem you're seeing?

p.lepin · Apr 17, 2007

I wonder how to indicate in a stylesheet that character
entities in an element are not to be transformed as would
be the case in XML-to-XML transforms. I want to keep
those & " and other character entities in the output as
they are in the input. My stylesheet converts them the
'&' etc., making the output XML not formed properly.

That's just plain wrong.

Possible causes:

1. Your transformation engine is broken.
2. Your transformation engine is just fine, thank you very
much, but you're telling it to do the wrong thing (e.
g., <xsl

utput method="text"/> while outputting
something other than text).
3. You're using disable-output-escaping.

Recommended solutions:

1. Get a slightly less broken transformation engine.
(xalan, saxon, libxslt are the Big Three I believe.)
2. Don't do that then.
3. Don't do that then.

R. P. · Apr 18, 2007

Joe Kesselman said:
If you set the output to XML -- or HTML -- problematic characters
should get converted back to entity (or numeric-character-reference)
representation. If you set text output mode, you're on your own.

Want to provide an example that demonstrates the problem you're
seeing?

Well, the following is not the real case but a pretty close example
of it:

Here is the original XML file from which two elements (releaseStatement
and releaseCode) will be extracted and the element names changed in the
process:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="SVCM_Transform.xsl"?>
<svcm xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="SVCMEdition_V1.0.xsd">
<edition>
<editionid>E12A60Z_20060910</editionid>
<editionType>New</editionType>
<editionNumber>11-00</editionNumber>
</edition>
<publication>
<productid>E12A60Z</productid>
<productType>00-14</productType>
<title>Service Manual</title>
<issueLevel>17</issueLevel>
<issued>2006-09-10</issued>
<available>2006-09-10</available>
<releaseStatement>No effort was spared to present accurate information
in this
Service Manual at the time of issue. Any errors & updates
that could
not wait till the next issue will be published in errata and
periodic
bulletins distributed to authorized
dealers.</releaseStatement>
<releaseCode scheme="DLR-1">Dealer Distribution</releaseCode>
<updateFreq scheme="YR">2</updateFreq>
</publication>
</svcm>

Here is the XSLT stylesheet I wrote to do the transform (I am pretty new
to XSLT, BTW):

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl

utput omit-xml-declaration="yes" indent="no" />

<xsl:template match="/">
<xsl:text><RELEASE_INFO></xsl:text>
<xsl:apply-templates select="*/publication/releaseStatement" />
<xsl:apply-templates select="*/publication/releaseCode" />
<xsl:text></RELEASE_INFO></xsl:text>
</xsl:template>

<xsl:template match="releaseStatement">
<RELEASE_STATEMENT>
<xsl:value-of select="."/>
</RELEASE_STATEMENT>
</xsl:template>

<xsl:template match="releaseCode">
<xsl:text><RELEASE_CODE SCHEME="</xsl:text>
<xsl:value-of select="@scheme"/>
<xsl:text>"></xsl:text>
<xsl:value-of select="."/></RELEASE_CODE>
</xsl:template>

</xsl:stylesheet>

The output from the XSLT process is this:

<RELEASE_INFO>
<RELEASE_STATEMENT>
No effort was spared to present accurate information in this
Service Manual at the time of issue. Any errors
& updates that could
not wait till the next issue will be published
in errata and periodic
bulletins distributed to authorized dealers.
</RELEASE_STATEMENT>
<RELEASE_CODE SCHEME="DLR-1">Dealer Distribution</RELEASE_CODE>
</RELEASE_INFO>

Note the '&' in the RELEASE_STATEMENT.
Unfortunately, putting the method="xml" attribute into xsl

utput is not
much help either
as it screws up some of the tag delimiters as so:

<RELEASE_INFO>
<RELEASE_STATEMENT>
No effort was spared to present accurate information in this
Service Manual at the time of issue. Any errors
& updates that could
not wait till the next issue will be published
in errata and periodic
bulletins distributed to authorized dealers.
</RELEASE_STATEMENT>
<RELEASE_CODE SCHEME="DLR-1">Dealer Distribution</RELEASE_CODE>
</RELEASE_INFO>

Perhaps Pavel is right and this Unix command line xml tool from Oracle
is broken and should try something else.

Rudy

p.lepin · Apr 18, 2007

Warning: shell session transcripts. Some lines may be
longer than 78 characters.

Here is the original XML file from which two elements
(releaseStatement and releaseCode) will be extracted and
the element names changed in the process:

[snippety snip]

Hmm, lessee.

pavel@debian:~/dev/xslt$ xmllint nonwf.xml
nonwf.xml:19: parser error : xmlParseEntityRef: no name
Service Manual at the time of issue. Any errors & updates
^

What do you know, this is not well-formed and therefore is
not XML. Any standards-compliant XML tool should choke on
it (as xmllint did).

If this is, indeed, the document you're working with, you
should talk to whoever you got that from and tell them that
whatever they may think, they're not producing XML.

If this is, indeed, the document you're working with and
your XML toolkit accepts it, then, indeed, your toolkit is
broken. And if your toolkit is broken, all bets are off.
Report the problem to whoever you got your tools from. If
the makers of the toolkit claim standards-compliance, you
might want to talk to your company's lawyers as well. They
tend to like stuff like this, scavenger mentality and all
that.

Here is the XSLT stylesheet I wrote to do the transform
(I am pretty new to XSLT, BTW):

OH MY GAWD!

Sorry for shouting.

<xsl:template match="/">
<xsl:text><RELEASE_INFO></xsl:text>
<xsl:apply-templates select="*/publication/releaseStatement" />
<xsl:apply-templates select="*/publication/releaseCode" />
<xsl:text></RELEASE_INFO></xsl:text>
</xsl:template>

Yeah, right. That's not well-formed either. I heartily
recommend that you stop right now and go read a good XSLT
tutorial.

What you want is:

<xsl:template match="/">
<RELEASE_INFO>
<xsl:apply-templates
select="*/publication/releaseStatement"/>
<xsl:apply-templates
select="*/publication/releaseCode"/>
</RELEASE_INFO>

said:
<xsl:template match="releaseCode">
<xsl:text><RELEASE_CODE SCHEME="</xsl:text>
<xsl:value-of select="@scheme"/>
<xsl:text>"></xsl:text>
<xsl:value-of select="."/></RELEASE_CODE>
</xsl:template>

Nope, that doesn't work, that shouldn't work, and is just
plain wrong. Did I mention reading XSLT tutorials?

<xsl:template match="releaseCode">
<RELEASE_CODE>
<xsl:attribute name="SCHEME">
<xsl:value-of select="@scheme"/>
</xsl:attribute>
<xsl:value-of select="."/>
</RELEASE_CODE>

said:
as it screws up some of the tag delimiters as so:

Please, just forget the tags while you're working with XML.
Just... let them go, you know? XML document is really a
tree of nodes. Some of the nodes are elements, some are
attributes and so on. There are no tags. Just a tree of
nodes. That's where the well-formedness constraints come
from. That text file with name ending in .xml is just a
serialization of a tree of nodes. The sooner you forget the
tags and See The Tree, the sooner you'll stop stumbling on
every other step and get some real work done.

Assuming you XML document actually *is* well-formed, and
with the fixes above:

pavel@debian:~/dev/xslt$ xsltproc nonwf.xsl nonwf.xml
<RELEASE_INFO><RELEASE_STATEMENT>No effort was spared to present
accurate information
in this
Service Manual at the time of issue. Any errors &
updates
that could
not wait till the next issue will be published in errata
and
periodic
bulletins distributed to authorized
dealers.</RELEASE_STATEMENT><RELEASE_CODE SCHEME="DLR-1">Dealer
Distribution</RELEASE_CODE></RELEASE_INFO>

Joe Kesselman · Apr 18, 2007

Actually,
>R. P. wrote:

(Sigh. In the past I've mostly ignored this sort of editing glitch. But
after having been beaten up about it a few times, I'm gonna take my turn
at reminding folks to be especially careful when trimming attributions.)

p.lepin · Apr 18, 2007

(Sigh. In the past I've mostly ignored this sort of
editing glitch. But after having been beaten up about it
a few times, I'm gonna take my turn at reminding folks to
be especially careful when trimming attributions.)

I've noticed that myself and will you believe I was
actually tempted to commit seppuku over it, especially
since I'm probably the most intolerant jerk on the group
where sloppy postings are concerned. Then I just figured
no one would notice that little slip on my part.

Anyway, 'Sir, yes sir! It won't happen again sir!
Permission to drop dead out of sheer embarrassment, sir!'

Oh. And I believe my sig delimiter is broken. That's
posting through Grougle Goops for you.

R. P. · Apr 19, 2007

Joe Kesselman said:
(Sigh. In the past I've mostly ignored this sort of editing glitch.
But after having been beaten up about it a few times, I'm gonna take
my turn at reminding folks to be especially careful when trimming
attributions.)

I'm not sure what you're getting at.
The 1st sample XML, BTW, was well formed and parsed fine. The output XML
from the transform was also what I expected when the text did not
contain any of those special character entities and I did not use the
method="xml" attribute. I guess Pavel did not take into account some
format loss of those samples during transfer in the news group. All in
all, I think I made a mistake to post about the whole issue over here.

RP

Joe Kesselman · Apr 19, 2007

R. P. said:
I'm not sure what you're getting at.

I was gently chastizing Pavel for a minor stylistic point. Nothing to
do with you.

I'm still not understanding what problem you're having. When XSLT reads
in a document, character references and entity references are expanded;
when it writes the document back out as either XML or HTML, it should
re-create character references where (and only where) they're necessary.

If your output mode is Text, other rules apply. If you're taking SAX or
DOM output from the XSLT processor and converting it to characters via
your own serializer code, then it's your responsibility to make sure
this is handled properly.

> Note the '&' in the RELEASE_STATEMENT.

As Pavel said, your stylesheet is REALLY badly designed. You should
***NEVER*** be trying to hand-construct tags. Doing so will at best rob
you of some of the strength of XSLT and force you to do lots of
unnecessary work, and at worst may seriously confuse the next stage of
processing. You really want to fix that, along the lines he illustrated
(issuing actual elements rather than text that looks like tags) before
you do anything else.

In fact, it is precisely because you *have* forced the hand-construction
of tags that the & isn't getting converted back to &. If you're
outputting text, the processor just writes out the characters. You need
to tell it that you're producing XML if you want it to convert
characters as necessary for XML.

In fact, that's precisely why setting output to XML was giving you <
in place of < -- because < is a reserved character, and has to be
escaped. (< is equivalent to <, just as & is equivalent to
&amp

.

Set output to XML mode. Replace your hand-constructed tags with real XML
structure. The result may not be *exactly* what you expect -- you may
get the &38; rather than & -- but it will be correct. Trying to make
the kluge you've got now do the right thing really is a lost cause; it
may not be possible, and it certainly isn't worth the effort.

You may or may not like that advice. But it's the correct advice.

p.lepin · Apr 19, 2007

[constructing serialized XML by hand in XSLT]

I guess Pavel did not take into account some format loss
of those samples during transfer in the news group.

There shouldn't be any format loss, but the Grougle Goops
seems to know better. For some reason they decided I want
my entities served expanded. I'd file a bug report, but
it's a well-known fact that's an exercise in futility where
google is concerned, unless the bug in question directly
affects their business.

The output XML from the transform was also what I
expected when the text did not contain any of those
special character entities and I did not use the
method="xml" attribute.

By using method="text" you're telling the serializer that
you're not outputting an XML document, and in that case if
you output an '&', you *want* to have it that way in the
resulting document. So it's *entirely* up to you to process
the character entities. You asked for it yourself, what
else did you expect?

So what you did do wrong? You're trying to use XSLT to
output a text that coincidentally may be interpreted as an
XML document. That's just plain wrong, wrong, wrong. What
you should be doing instead is constructing an XML result
tree, and letting it be serialized into a well-formed XML
document, using your XML parser's/XSLT processor's
understanding of XML serialization (which is likely far
superior to your understanding... or mine, for that
matter).

All in all, I think I made a mistake to post about the
whole issue over here.

Perhaps, since you don't seem to be listening to advice. I
explained not just what was wrong with your transformation,
but also how to fix it, and Joseph elaborated on that
giving you a good bit of background to understand what's
really going on under the covers.

Moreover, I told you what your biggest problem is: that you
don't understand how XSLT works and how it should be used.
I even told you what you should about that: read a good
XSLT tutorial.

If you believe you know better, well, have fun.

Joe Kesselman · Apr 19, 2007

By the way, if Google is messing up posting of XML examples, that's a
really strong argument for not using Google when posting examples to
this newsgroup -- or for putting them on a website via some mechanism
that doesn't damage them, and just posting the URI here.

If we have to patch the question before answering it, people will
grumble at best and may decide it isn't worth the effort. Helping us
help you is a Good Thing.

Blunt but good advice on how to work effectively with newsgroups:
http://www.catb.org/~esr/faqs/smart-questions.html

Joseph Kesselman · Apr 19, 2007

Blunt but good advice on how to work effectively with newsgroups:

http://www.catb.org/~esr/faqs/smart-questions.html

Which, by the way, also has a section on giving good answers. Newbies by
definition are going to write unreasonable code; we should try to
remember that this is usually ignorance, not stupidity, and correct them
without abusing them more than necessary.

"Rule Two, no member of the faculty is to maltreat the Abos in any way
at all -- if there's anybody watching."
(http://www.adelaide.edu.au/library/guide/hum/philosophy/philos_bruce.html)

p.lepin · Apr 19, 2007

By the way, if Google is messing up posting of XML
examples, that's a really strong argument for not using
Google when posting examples to this newsgroup -- or for
putting them on a website via some mechanism that doesn't
damage them, and just posting the URI here.

I don't think the problem is with posting the messages
through GG. I'm experiencing some rather weird behavior
while *viewing* the postings through GG. For example, while
viewing your recent message

<[email protected]>

I'm getting the same results while viewing the thread
normally and while asking the GG to show the original,
unparsed, unmangled message:

<quote>

In fact, that's precisely why setting output to XML was giving you
<
in place of < -- because < is a reserved character, and has to be
escaped. (< is equivalent to <, just as & is equivalent to
&amp

.

</quote>

But that's not the case for OP's message

<[email protected]>

In the thread view I'm seeing the following:

<quote>

<xsl:template match="/">
<xsl:text><RELEASE_INFO></xsl:text>
<xsl:apply-templates select="*/publication/releaseStatement" />
<xsl:apply-templates select="*/publication/releaseCode" />
<xsl:text></RELEASE_INFO></xsl:text>
</xsl:template>

</quote>

....while the original message source displays as:

<quote>

<xsl:template match="/">
<xsl:text><RELEASE_INFO></xsl:text>
<xsl:apply-templates select="*/publication/releaseStatement" />
<xsl:apply-templates select="*/publication/releaseCode" />
<xsl:text></RELEASE_INFO></xsl:text>
</xsl:template>

</quote>

This is beyond annoying. *sigh* I guess I'll have to find
a reliable nntp server and set up a real newsreader after
all these years...

Joseph Kesselman · Apr 19, 2007

> But that's not the case for OP's message

That sounds like something you could report to Google as a minimal "why
is this one OK when that one isn't" bug. Or, as you say, you could
switch to tools that have already been debugged.

R. P. · Apr 20, 2007

Joe Kesselman said:
By the way, if Google is messing up posting of XML examples, that's a
really strong argument for not using Google when posting examples to
this newsgroup -- or for putting them on a website via some mechanism
that doesn't damage them, and just posting the URI here.

I didn't use Google. I used the Outlook Express news reader with
Comcast's news server.

R. P.

R. P. · Apr 20, 2007

Joseph Kesselman said:
Which, by the way, also has a section on giving good answers. Newbies
by definition are going to write unreasonable code; we should try to
remember that this is usually ignorance, not stupidity, ...

Well, thank you for that. I might add to it that some of us rarely need
xml and spending large amount of time getting up on the XSLT learning
curve is not a very efficient use of our time. I probably would have to
relearn the whole thing when I need it next time because I would have
forgotten most of it by then. Now if I made most of my living on xml
related stuff, that would change the equation for me radically. I have a
feeling that you two earn most of your living from your xml skills.

R. P.

Joe Kesselman · Apr 20, 2007

R. P. said:
not a very efficient use of our time.

If so, it may be time to hire someone who's already up the curve, at
least as a part-time consultant. (If it isn't worth your time, why is it
worth ours?)

The downside of relying on free advice is that it's up to you to do your
homework, look at existing examples/tutorials and pose the questions in
a form that makes it easy for volunteers to help you -- or to put up
with being grumbled at or ignored when you fail to do so.

Pavel Lepin · Apr 20, 2007

R. P. said:
I might add to it that some of us rarely need xml and
spending large amount of time getting up on the XSLT
learning curve is not a very efficient use of our time.

That's an oft seen attitude. Unfortunately, it's just plain
wrong. If you need to use foo, you either invest into
acquiring a certain familiarity with foo or else hire
someone who did just that. If you don't understand foo,
invoke foo at the peril of suffering your management's
displeasure when the outcome turns out to be a disaster.

Now if I made most of my living on xml related stuff, that
would change the equation for me radically.

XSLT is a DSL, and it's suffering from the same problem that
plagues most of the other DSLs (such as SQL and regexen, to
name a couple of most prominent). For some reason, J.
Random Developer seems to think he doesn't need to grok the
DSLs he's using.

I have a feeling that you two earn most of your living
from your xml skills.

I've no idea if you're right in Joseph's case, but certainly
not in mine. I earn 95% of my living from my communication
with suits and pointy-haireds, maintenance & bug-hunting,
OOA&D and PHP skills (in roughly that order). The remaining
5% come mostly from my familiarity with JavaScript, HTML
DOM and SQL.

Joe Kesselman · Apr 20, 2007

I have a feeling that you two earn most of your living

I've no idea if you're right in Joseph's case,

Let's see... DOM Working Group, original version of the Xerces DOM
implementation, large percentage of the code in Xalan, influencing some
of IBM's other XML processing chains... No, actually I earn most of my
living from my data structures and algorithm skills; XML is just where I
happen to have been applying those in recent years.

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
Questions about character entities in XML and PCI security compliance	7	Aug 7, 2008
problem with xslt transformation	2	Mar 27, 2007
Leave " alone in transformation	2	Nov 14, 2008
entity parameterization	1	Dec 8, 2012
Identity transformation problems	7	Sep 25, 2006
Reading output xml in the same XSLT, which perform the transformation	2	May 19, 2005

Suppressing character entity transformation

R. P.

Joe Kesselman

p.lepin

R. P.

p.lepin

Joe Kesselman

p.lepin

R. P.

Joe Kesselman

p.lepin

Joe Kesselman

Joseph Kesselman

p.lepin

Joseph Kesselman

R. P.

R. P.

Joe Kesselman

Pavel Lepin

Joe Kesselman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads