Extended Characters in XML

barthome1 · Mar 18, 2005

Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?

Thanks ahead of time for any help.

Bart
(e-mail address removed)

Andreas Prilop · Mar 18, 2005

The data includes some of the extended characters. We get strange
accent marks, italics
Italics??

and the like. These characters have decimal
value in the 200+ range.
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

One possibility is to write all of them in the form
where number is the decimal code position in Unicode.

Is there a standard escape sequences for the extended characters?

, which is the same as in SGML/HTML. See
http://www.unics.uni-hannover.de/nhtcapri/multilingual2.html
for examples in various scripts.

Martin Honnen · Mar 19, 2005

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

Any XML parser is supposed to support the UTF-8 encoding thus you could
encode your XML documents as UTF-8 and then you are able to use all
characters Unicode supports directly in your document. You only need to
make sure you use an editor that allows creation of UTF-8 encoded
documents. Or you could, as already suggested, escape characters with
the Unicode code point e.g. € for the Euro sign â‚¬.
<http://www.unicode.org/>

Jukka K. Korpela · Mar 19, 2005

Andreas Prilop said:
Italics??

That sounds somewhat strange indeed, since normally the font style is
expressed at a level other than character level, e.g. in markup.
(Contrary to populistic propaganda, XML markup is not inherently
"logical"; nothing prevents you from using XML markup for purely
presentational purposes. If you need to store information in a manner
that preserves formatting information, that might be a good idea.
Using <i> for italics as in HTML would be natural then.)

But there _are_ characters in Unicode that are italicized variants of
other characters. Many of them are compatibility characters that have
been included just because they exist as characters in other standards.
There are other cases as well. If this topic is relevant, then the
document "Unicode in XML and other Markup Languages"
http://www.w3.org/TR/unicode-xml/ should be studied.

One possibility is to write all of them in the form
where number is the decimal code position in Unicode.

That's certainly a way represent them in XML, and this might be useful
to protect against problems with encodings (and transcoding). However
it normally wins nothing and loses a lot in readability of the text in
XML source. (In XML it might be better to use where hhhh is
the code in hexadecimal, since character code standards and references
generally use hex.)

If the data needs to be processed using old software too, then all
kinds of problems may arise. If you need to prepared to _anything_,
then only the invariant subset of ASCII is safe, or mostly safe. But it
would be a mistake to convert data to ASCII using some simplifications
and transmogrifications, unless you _know_ there will be serious and
unsolvable problems otherwise.

Anything that you can use XML technology even in the feeblest sense
_must_ be able to accept data in UTF-8 encoding and at least store and
forward it unmodified, even if it is incapable of rendering all the
characters or recognizing them in a useful way. So the first step
should be to convert the arriving data into UTF-8 in a safe way.
Normally you should get information about the encoding of the data and
do the conversions automatically, but at early phases you might wish to
do some occasional checks to verify the sensibility of the data. It is
not uncommon to send text data as incorrectly labelled (as regards to
its encoding), or unlabelled (so that the recipient must guess or
deduce what encoding has been used).

Quite apart from this, we cannot realistically expect that all Unicode
characters will be adequately processed and rendered. So it's very
relevant what characters there will be in the input data and how it
should be processed. For example, we can probably expect that if some
software is advertized as reading XML data and storing it into a
database and supporting some searching and retrieval, then it will
accept and store any Unicode data in UTF-8 format. But it might fail to
display the data when retrieved, its sorting routines might not work by
Unicode rules, its case-insensitive search might be something rather
trivial that really works for basic Latin letters only, and it might
even fail to display characters properly right to left according to
their directionality.

Shmuel (Seymour J.) Metz · Mar 20, 2005

In <[email protected]>, on
03/18/2005
at 08:17 AM, (e-mail address removed) said:

So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret
them correctly and not simply reject the document as flawed?

You can't really guaranty anything, but your best bet is probably to
use UTF-8, which is a transform of Unicode into 8-bit bytes. Note that
there are standard entity names for many Unicode characters.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (e-mail address removed)

Jukka K. Korpela · Mar 20, 2005

You can't really guaranty anything, but your best bet is probably
to use UTF-8, which is a transform of Unicode into 8-bit bytes.
Indeed.

Note that there are standard entity names for many Unicode
characters.

No, there aren't - in XML. In XML, the only predefined entity names
are <, >, &, ", and '.

There are "standard entity names" in the sense that the SGML standard
contains a large number of entity declarations as samples, and some of
them have been copied to HTML. But from the XML viewpoint, there is
nothing standard about them; XML is logically independent of the SGML
standard. One might argue that if you declare entities that denote
Unicode characters, it would be advisable to use the same names as in
the SGML standard if possible. But even this is far from clear; the
SGML names are partly ridiculously and obscurely truncated (quickly,
guess what the "mnemonic" &lang; means!). Besides, you don't _need_ the
entities (except < and &amp

when you use UTF-8.

Peter Flynn · Mar 22, 2005

Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

Accents are normal in many non-English languages, so they probably
aren't "strange" to the originators. As Jukka has pointed out, what
look like italics are probably variant characters which happen to
be sloping.

So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

If you use XML software which conforms to the standards then it will handle
all the characters correctly (provided you also conform to the same
standards). If you need to be able to accept pretty much any character
from any source, use the UTF-8 encoding.

We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?

">" is not a reserved character, it's just a character. It only has a
special meaning when it's used to close a start-tag or end tag. The
only two reserved characters are "<" and "&". The latter is the one you
want for the named or numeric codes for non-ASCII characters, but if you
use UTF-8 then you won't need it at all except for espacing "<" and "&",
as has already been pointed out.

///Peter

Using SOAP in XML	0	Jun 9, 2014
suppressing bad characters in output PCDATA (converting JSON to XML)	6	Nov 25, 2011
International Workshop on OpenMP 2013 - Extended submission deadline- May 10, 2013	0	Apr 29, 2013
Syncro Soft Announces New Release of Oxygen XML Editor version 14.2	0	Feb 14, 2013
can I use element tree for handling special characters in xml text?	1	Jul 27, 2011
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Transforming XML containing Asian characters?	4	Jun 8, 2005
Newbie question about how to solve the use escape characters	2	Nov 15, 2005

Extended Characters in XML

barthome1

Andreas Prilop

Martin Honnen

Jukka K. Korpela

Shmuel (Seymour J.) Metz

Jukka K. Korpela

Peter Flynn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads