Preventing the UTF-8 Parser from converting an entity?

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?= · Sep 18, 2006

Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity (
) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud

Bjoern Hoehrmann · Sep 18, 2006

* Jean-François Michaud wrote in comp.text.xml:

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity (
) within an attribute that we are using to paliate
CSS limitations.

I don't understand your question. First,
is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

What is "\n" here? What do you mean by "converted"? What do you mean by
keeping it? Processing white-space characters and character references
to them in attribute values is explained in the XML specification. XML
processors keep them to the extent that they are significant. If you
connect the processor to a serializer, the input and output documents
will be canonically equivalent unless one of them has a bug. So there
should be no issue here.

Martin Honnen · Sep 18, 2006

Jean-François Michaud wrote:

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity (
) within an attribute that we are using to paliate
CSS limitations.

is not an entity nor an entity reference, rather a numeric
character reference.
What is an "UTF-8 parser"?

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

It is not clear what kind of tool you use and what you produce finally
but if you want to serialize a DOM or an XSLT result tree to XML markup
and want that newline character to be escaped as
as a numeric
character reference then you need an XML serializer that does that. If
you want to serialize such a tree to HTML markup then you need a HTML
serializer that does that.

Richard Tobin · Sep 18, 2006

Jean-François Michaud said:
After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/> might be
a better solution.

-- Richard

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?= · Sep 18, 2006

Richard said:
Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/> might be
a better solution.

Understandably, but we are using a stange combinary of XML + CSS under
the VEX XML editor.

We are displaying the attribute before a bit of text, but because of a
silly CSS limitation (not being able to test for a condition in a
pseudo :before element), we thought that postpending the

character at the end of the string would do the trick. It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding. We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output. Using an <nl/>
element is simply not an option.

Regards
Jean-Francois Michaud

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?= · Sep 18, 2006

Bjoern said:
* Jean-François Michaud wrote in comp.text.xml:

I don't understand your question. First,
is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.

Alright let me clarify, We allow for numeric character references to be
included in our XML document so that special characters can be included
in the output. These numeric sequences get converted to UTF-8 encoding
for proper transformation into yet another XML which is then
transformed into PDF using XSLT/XSL:FO. All the way through, encoding
has to abide by UTF-8, hence the reason why the numeric sequences have
to be converted to meet this restriction. The problem is that the XML
editor that we use to display the XML content (using XML + CSS) doesn't
use UTF-8 encoded characters when dealing with formatting. It
recognizes the
character, but not the UTF-8 version of it.

The problem all stems from CSS being unable to allow for me to test a
condition while displaying using a :before pseudo element (I can either
display using :before, or I can test for a condition, but I can't do
both at the same time. Yay for CSS!).

The solution was to append the character
at the end of the string
attribute that we want to display so that the carriage return only
occurs when the string is non empty. This works splendidly but as soon
as we save the document, the engine converts everything to UTF-8
encoding (booo!).

[snip]

Regards
Jean-Francois Michaud

Joseph Kesselman · Sep 18, 2006

The solution was to append the character
at the end of the string

>attribute

If you mean inside the attribute value... A properly functioning XML
serializer should recognize line breaks within attribute values as a
special case and escape them as necessary to write them back out,
typically as
.

However, the distinction between
, CR, LF, and CRLF will not be
preserved elsewhere. The only place where XML cares about the difference
between these is in the details of attribute value normalization and
serialization.

And while looking at the parsed version of the data (as output from the
parser but not run back through a serializer, you will always see these
as the newline character,

I'm still not sure from your description which of these applies to your
particular problem. You might want to post a very explicit description
of what your source XML looks like, how you're viewing the result of the
parse, and what you're seeing.

In any case, UTF-8 has nothing to do with any of the above; it's
strictly XML behaviors.

Joseph Kesselman · Sep 18, 2006

Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?= · Sep 18, 2006

Joseph said:
Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).

I know, that would have been my take also. The technology that we are
using is the VEX XML editor. It allows users to update XML content as
if they were in word which is not entirely uninterresting, but CSS is
not advanced enough for this XML + CSS combo to work perfectly when
more demanding formatting is necessary. VEX unfortunately uses CSS to
render the output on display. No way around this short of throwing
everything in the garbage altogether and thats just not gonna happen.

Regards
Jeff

Richard Tobin · Sep 18, 2006

We are displaying the attribute before a bit of text

If the character is in an attribute, rather than content, it should be
output as
or an equivalent reference. This is because an
ordinary linefeed would be normalised to a space character when the
file is read in again.

It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding.

Just to be clear about this: linefeed is an ASCII character, and is the
same in UTF-8 as in ASCII.

We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output.

I really don't understand this at all. The encoding is not relevant
here. In your input file, you will have
. A program that reads
(parses) this will have a linefeed character in its data, using
whatever internal encoding it happens to use. UTF-8 only becomes
relevant when you output the file, and as I said a linefeed in an
attribute should be output as
rather than a linefeed character.

-- Richard

Joseph Kesselman · Sep 18, 2006

(parses) this will have a linefeed character in its data [...]

attribute should be output as
rather than a linefeed character.

Absolutely. If you're looking at the parsed form of the attribute's
value, you should see the newline character. If you're looking at the
text form, you should see
. If either is not true, your tools are
broken.

Philippe Poulard · Sep 19, 2006

Jean-François Michaud said:
Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity (
) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud

hi,

[CR], [LF], [CR/LF] are normalized by XML parsers, but characters
references are left as-is (the value you see is the character that is
referred)

that is to say, if you parse the following document :

<?xml version="1.0"?>
<foo bar="abc
def
ghi"/>

(with [CR/LF] between "def" and "ghi")
you will get that value :

abc
def ghi

(with [CR/LF] between "abc" and "def")

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !

converting xml file to schema file problem	2	Oct 7, 2009
Lost UTF-8 encoding on all files while converting ASP.NET web from 1.1 to 2.0	4	Sep 28, 2007
How to clean an xml files from non-utf-8 chars?	18	Sep 17, 2008
XML::PARSER utf-8 and japanese characters	1	Jul 28, 2004
RFC, an ugly parser hack (and a bin-xml variant)	3	Sep 5, 2005
Even McMahon fails validation	21	Nov 17, 2011
Finding the value of "TOP" from prior block-containers used	2	Nov 11, 2005
RFC: Building the Perfect Tabbed Pane (an tutorial article)	69	Feb 13, 2008

Preventing the UTF-8 Parser from converting an entity?

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=

Bjoern Hoehrmann

Martin Honnen

Richard Tobin

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=

Joseph Kesselman

Joseph Kesselman

=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=

Richard Tobin

Joseph Kesselman

Philippe Poulard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads