Andreas Prilop said:
That sounds somewhat strange indeed, since normally the font style is
expressed at a level other than character level, e.g. in markup.
(Contrary to populistic propaganda, XML markup is not inherently
"logical"; nothing prevents you from using XML markup for purely
presentational purposes. If you need to store information in a manner
that preserves formatting information, that might be a good idea.
Using <i> for italics as in HTML would be natural then.)
But there _are_ characters in Unicode that are italicized variants of
other characters. Many of them are compatibility characters that have
been included just because they exist as characters in other standards.
There are other cases as well. If this topic is relevant, then the
document "Unicode in XML and other Markup Languages"
http://www.w3.org/TR/unicode-xml/ should be studied.
One possibility is to write all of them in the form
where number is the decimal code position in Unicode.
That's certainly a way represent them in XML, and this might be useful
to protect against problems with encodings (and transcoding). However
it normally wins nothing and loses a lot in readability of the text in
XML source. (In XML it might be better to use where hhhh is
the code in hexadecimal, since character code standards and references
generally use hex.)
If the data needs to be processed using old software too, then all
kinds of problems may arise. If you need to prepared to _anything_,
then only the invariant subset of ASCII is safe, or mostly safe. But it
would be a mistake to convert data to ASCII using some simplifications
and transmogrifications, unless you _know_ there will be serious and
unsolvable problems otherwise.
Anything that you can use XML technology even in the feeblest sense
_must_ be able to accept data in UTF-8 encoding and at least store and
forward it unmodified, even if it is incapable of rendering all the
characters or recognizing them in a useful way. So the first step
should be to convert the arriving data into UTF-8 in a safe way.
Normally you should get information about the encoding of the data and
do the conversions automatically, but at early phases you might wish to
do some occasional checks to verify the sensibility of the data. It is
not uncommon to send text data as incorrectly labelled (as regards to
its encoding), or unlabelled (so that the recipient must guess or
deduce what encoding has been used).
Quite apart from this, we cannot realistically expect that all Unicode
characters will be adequately processed and rendered. So it's very
relevant what characters there will be in the input data and how it
should be processed. For example, we can probably expect that if some
software is advertized as reading XML data and storing it into a
database and supporting some searching and retrieval, then it will
accept and store any Unicode data in UTF-8 format. But it might fail to
display the data when retrieved, its sorting routines might not work by
Unicode rules, its case-insensitive search might be something rather
trivial that really works for basic Latin letters only, and it might
even fail to display characters properly right to left according to
their directionality.