This post is a follow up to the post at
:
http://groups.google.com/group/comp.lang.c++/browse_thread/thread/83a...
as my original question was answered there, but I have some additional
problems now.
Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")
The UTF-8 representation of "&" is a single byte, with the
value 0x26. Formally, that might be a '&', or it might not.
(In practice, it usually is
. Even the IBM mainframe version
of C that I've seen mapped the native EBCDIC to ASCII, so that
within C programs, '&' was 0x26. I'm not sure how this would
have been written to a text file; the more common variants of
EBCDIC don't have a & character.)
What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :
http://www.ascii.cl/htmlcodes.htm)
My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"
You're not allowed to use universal character names for
characters in the basic character set. A simple "&" will work
in this case, giving you the encoding of an ampersign in
whatever the compiler uses as its default narrow character
encoding (which will be compatible with ASCII/UTF-8/ISO 8859-n
99.9% of the time).
You really have two separate problems. One is converting the
sequence "&" to whatever internal encoding you are using
(e.g. UTF-8). The second is converting this internal encoding
to whatever the display device (or file) expects. If the
display device can handle UTF-8, you're home free. If it can't
you'll have to convert the UTF-8 encodings into something it can
handle. In the case of "&", there's a 99.9% chance that the
display device will handle the UTF-8 encoding correctly, since
in this particular case, it is also the ASCII encoding. (And
thus, the encoding in all of the ISO 8859-n character sets as
well. Of course, if you fall into the 0.1% chance, and your
display device uses EBCDIC, then you might not be able to
display it at all.) For other characters, it's far from
obvious, however; something like "—" maps to Unicode
'\u2014' -- the sequence 0xE1, 0x80, 0x94 in UTF-8. Depending
on the encoding used by the display device, you may be able to
map this directly; otherwise (in this case---there isn't always
a good general solution for this), you might map it to a 0x2D
(hyphen-minus in ASCII), or maybe a sequence of two of them. In
some cases, there really isn't any good solution---the input
specifies some Chinese ideograph, and the display device doesn't
have an Chinese ideographs in its fonts. A lot depends on just
what characters you want to support, and how much effort you
want to invest.
Note that it's not always simple to know what the display device
actually supports, either. Under X, it is the font which
determines the encoding. If you're managing the windows
yourself, you select the font, and you can probably know what
the encoding is. (The X font specification string has fields
for the encoding.) (I'm not too familiar with Windows; I think
the Window manager will always handle UTF-16, mapping it itself
if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea
that the user will later do a cat, you have the problem that
different windows can use different fonts with different
encodings; the problem is unsolvable. You just have to
establish a convention, tell the user about it, and leave it up
to him. (In the Unix world, or anything networked, I'd use
UTF-8, unless there were some constraints involving legacy
files; in a purely Windows environment, I'd probably use
UTF-16LE.)