Extended Characters in XML

B

barthome1

Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?

Thanks ahead of time for any help.

Bart
(e-mail address removed)
 
A

Andreas Prilop

The data includes some of the extended characters. We get strange
accent marks, italics
Italics??

and the like. These characters have decimal
value in the 200+ range.
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

One possibility is to write all of them in the form
where number is the decimal code position in Unicode.
Is there a standard escape sequences for the extended characters?

, which is the same as in SGML/HTML. See
http://www.unics.uni-hannover.de/nhtcapri/multilingual2.html
for examples in various scripts.
 
M

Martin Honnen

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

Any XML parser is supposed to support the UTF-8 encoding thus you could
encode your XML documents as UTF-8 and then you are able to use all
characters Unicode supports directly in your document. You only need to
make sure you use an editor that allows creation of UTF-8 encoded
documents. Or you could, as already suggested, escape characters with
the Unicode code point e.g. € for the Euro sign €.
<http://www.unicode.org/>
 
J

Jukka K. Korpela

Andreas Prilop said:
Italics??

That sounds somewhat strange indeed, since normally the font style is
expressed at a level other than character level, e.g. in markup.
(Contrary to populistic propaganda, XML markup is not inherently
"logical"; nothing prevents you from using XML markup for purely
presentational purposes. If you need to store information in a manner
that preserves formatting information, that might be a good idea.
Using <i> for italics as in HTML would be natural then.)

But there _are_ characters in Unicode that are italicized variants of
other characters. Many of them are compatibility characters that have
been included just because they exist as characters in other standards.
There are other cases as well. If this topic is relevant, then the
document "Unicode in XML and other Markup Languages"
http://www.w3.org/TR/unicode-xml/ should be studied.
One possibility is to write all of them in the form
where number is the decimal code position in Unicode.

That's certainly a way represent them in XML, and this might be useful
to protect against problems with encodings (and transcoding). However
it normally wins nothing and loses a lot in readability of the text in
XML source. (In XML it might be better to use where hhhh is
the code in hexadecimal, since character code standards and references
generally use hex.)

If the data needs to be processed using old software too, then all
kinds of problems may arise. If you need to prepared to _anything_,
then only the invariant subset of ASCII is safe, or mostly safe. But it
would be a mistake to convert data to ASCII using some simplifications
and transmogrifications, unless you _know_ there will be serious and
unsolvable problems otherwise.

Anything that you can use XML technology even in the feeblest sense
_must_ be able to accept data in UTF-8 encoding and at least store and
forward it unmodified, even if it is incapable of rendering all the
characters or recognizing them in a useful way. So the first step
should be to convert the arriving data into UTF-8 in a safe way.
Normally you should get information about the encoding of the data and
do the conversions automatically, but at early phases you might wish to
do some occasional checks to verify the sensibility of the data. It is
not uncommon to send text data as incorrectly labelled (as regards to
its encoding), or unlabelled (so that the recipient must guess or
deduce what encoding has been used).

Quite apart from this, we cannot realistically expect that all Unicode
characters will be adequately processed and rendered. So it's very
relevant what characters there will be in the input data and how it
should be processed. For example, we can probably expect that if some
software is advertized as reading XML data and storing it into a
database and supporting some searching and retrieval, then it will
accept and store any Unicode data in UTF-8 format. But it might fail to
display the data when retrieved, its sorting routines might not work by
Unicode rules, its case-insensitive search might be something rather
trivial that really works for basic Latin letters only, and it might
even fail to display characters properly right to left according to
their directionality.
 
S

Shmuel (Seymour J.) Metz

In <[email protected]>, on
03/18/2005
at 08:17 AM, (e-mail address removed) said:
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret
them correctly and not simply reject the document as flawed?

You can't really guaranty anything, but your best bet is probably to
use UTF-8, which is a transform of Unicode into 8-bit bytes. Note that
there are standard entity names for many Unicode characters.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (e-mail address removed)
 
J

Jukka K. Korpela

You can't really guaranty anything, but your best bet is probably
to use UTF-8, which is a transform of Unicode into 8-bit bytes.
Indeed.

Note that there are standard entity names for many Unicode
characters.

No, there aren't - in XML. In XML, the only predefined entity names
are &lt;, &gt;, &amp;, &quot;, and &apos;.

There are "standard entity names" in the sense that the SGML standard
contains a large number of entity declarations as samples, and some of
them have been copied to HTML. But from the XML viewpoint, there is
nothing standard about them; XML is logically independent of the SGML
standard. One might argue that if you declare entities that denote
Unicode characters, it would be advisable to use the same names as in
the SGML standard if possible. But even this is far from clear; the
SGML names are partly ridiculously and obscurely truncated (quickly,
guess what the "mnemonic" &lang; means!). Besides, you don't _need_ the
entities (except &lt; and &amp;) when you use UTF-8.
 
P

Peter Flynn

Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

Accents are normal in many non-English languages, so they probably
aren't "strange" to the originators. As Jukka has pointed out, what
look like italics are probably variant characters which happen to
be sloping.
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

If you use XML software which conforms to the standards then it will handle
all the characters correctly (provided you also conform to the same
standards). If you need to be able to accept pretty much any character
from any source, use the UTF-8 encoding.
We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?

">" is not a reserved character, it's just a character. It only has a
special meaning when it's used to close a start-tag or end tag. The
only two reserved characters are "<" and "&". The latter is the one you
want for the named or numeric codes for non-ASCII characters, but if you
use UTF-8 then you won't need it at all except for espacing "<" and "&",
as has already been pointed out.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,999
Messages
2,570,246
Members
46,839
Latest member
MartinaBur

Latest Threads

Top