Displaying Non-ASCII Characters in C++

tushar.saxena · Dec 5, 2007

This post is a follow up to the post at :
http://groups.google.com/group/comp...hread/83af6123fa945e8b?hl=ug#9eaa6fab5622424e
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source : http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Thx!

Tushar

Alf P. Steinbach · Dec 5, 2007

* (e-mail address removed):

This post is a follow up to the post at :
http://groups.google.com/group/comp...hread/83af6123fa945e8b?hl=ug#9eaa6fab5622424e
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source : http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Basically you have to convert from Unicode to whatever character set
your terminal (or other display) expects.

Most systems today use characters sets that are extensions of ASCII, and
in particular, ISO Latin-1 is an extension of ASCII, and Unicode is in
turn an extension of ISO Latin-1.

Characters in the ASCII range (0..127 inclusive) therefore probably need
no translation. On Windows, characters in the Latin-1 range (0..255)
can also be displayed directly by setting the console window to codepage
1252 (Windows ANSI Western), because CP 1252 is a superset of Latin-1.
On a system expecting UTF-8 you'll have to do conversion to UTF-8 for
characters outside the ASCII range, but unfortunately the C++ standard
library offers no means of doing conversions to particular character
sets; you might try the conversions ("narrow", "widen") in the locale
part of the library, and if they work, great, but if not, uh, not so
great...

Cheers, & hth.,

- Alf

Rahul · Dec 5, 2007

This post is a follow up to the post at :http://groups.google.com/group/comp.lang.c++/browse_thread/thread/83a...
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Thx!

Tushar

You could refer to unicode.org FAQs, that might help.
By the way, are you trying to display non-ASCII characters on the
prompt?
AFAIK, DOS atleast doesn't support non-ASCII characters...
which platform are you working on?

James Kanze · Dec 5, 2007

This post is a follow up to the post at
:http://groups.google.com/group/comp.lang.c++/browse_thread/thread/83a...
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

The UTF-8 representation of "&" is a single byte, with the
value 0x26. Formally, that might be a '&', or it might not.
(In practice, it usually is

. Even the IBM mainframe version
of C that I've seen mapped the native EBCDIC to ASCII, so that
within C programs, '&' was 0x26. I'm not sure how this would
have been written to a text file; the more common variants of
EBCDIC don't have a & character.)

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

You're not allowed to use universal character names for
characters in the basic character set. A simple "&" will work
in this case, giving you the encoding of an ampersign in
whatever the compiler uses as its default narrow character
encoding (which will be compatible with ASCII/UTF-8/ISO 8859-n
99.9% of the time).

You really have two separate problems. One is converting the
sequence "&" to whatever internal encoding you are using
(e.g. UTF-8). The second is converting this internal encoding
to whatever the display device (or file) expects. If the
display device can handle UTF-8, you're home free. If it can't
you'll have to convert the UTF-8 encodings into something it can
handle. In the case of "&", there's a 99.9% chance that the
display device will handle the UTF-8 encoding correctly, since
in this particular case, it is also the ASCII encoding. (And
thus, the encoding in all of the ISO 8859-n character sets as
well. Of course, if you fall into the 0.1% chance, and your
display device uses EBCDIC, then you might not be able to
display it at all.) For other characters, it's far from
obvious, however; something like "—" maps to Unicode
'\u2014' -- the sequence 0xE1, 0x80, 0x94 in UTF-8. Depending
on the encoding used by the display device, you may be able to
map this directly; otherwise (in this case---there isn't always
a good general solution for this), you might map it to a 0x2D
(hyphen-minus in ASCII), or maybe a sequence of two of them. In
some cases, there really isn't any good solution---the input
specifies some Chinese ideograph, and the display device doesn't
have an Chinese ideographs in its fonts. A lot depends on just
what characters you want to support, and how much effort you
want to invest.

Note that it's not always simple to know what the display device
actually supports, either. Under X, it is the font which
determines the encoding. If you're managing the windows
yourself, you select the font, and you can probably know what
the encoding is. (The X font specification string has fields
for the encoding.) (I'm not too familiar with Windows; I think
the Window manager will always handle UTF-16, mapping it itself
if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea
that the user will later do a cat, you have the problem that
different windows can use different fonts with different
encodings; the problem is unsolvable. You just have to
establish a convention, tell the user about it, and leave it up
to him. (In the Unix world, or anything networked, I'd use
UTF-8, unless there were some constraints involving legacy
files; in a purely Windows environment, I'd probably use
UTF-16LE.)

tushar.saxena · Dec 5, 2007

Thanks for the replies everyone. The input file where I am reading the
data from is encoded in UTF-8, and so is the output file where I have
to write the modified data. The OS used is Linux. The terminal I use
is UTF-8 enabled, as I can correctly see characters beyond the normal
ASCII range. In any case, I am not so much worried about the actual
display of the characters as much as writing the correct data into the
file. In essence, the tool I am writing should replace the
HTML::Entities package that can be used in Perl (well amongst other
things anyways). Many languages have native support for such a
library, I was rather surprised to see that C++ doesn't.

Thomas Dickey · Dec 5, 2007

James Kanze said:
Note that it's not always simple to know what the display device
actually supports, either. Under X, it is the font which
determines the encoding. If you're managing the windows

....and an X application can tell if the font can display the character.

yourself, you select the font, and you can probably know what
the encoding is. (The X font specification string has fields
for the encoding.) (I'm not too familiar with Windows; I think
the Window manager will always handle UTF-16, mapping it itself

....not exactly (it accepts it, but there appear to be only very
cumbersome ways to determine if a given character is displayable
in the GUI, and none at all for the console windows).

if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea

....most people would rely on the locale settings to give a hint here.

James Kanze · Dec 6, 2007

James Kanze <[email protected]> wrote:

[...]

...most people would rely on the locale settings to give a
hint here.

It depends. At least under Unix with X, locale and the font
encoding are completely independent. And neither can really
solve the most basic problem: if I write to a file, what should
I write if the file will later be copied to two different
devices, using two different encodings?

The problems are far from simple. On the whole, I'd say when in
doubt, use UTF-8, and I'd certainly opt for UTF-8 for most new
uses. But legacy code and legacy environments won't go away
like that: where I work, for some reason, there are no UTF-8
fonts installed (for X); at home, I still have an old printer
which only understands ISO 8859-1, etc.

James Kanze · Dec 6, 2007

Thanks for the replies everyone. The input file where I am reading the
data from is encoded in UTF-8, and so is the output file where I have
to write the modified data. The OS used is Linux. The terminal I use
is UTF-8 enabled, as I can correctly see characters beyond the normal
ASCII range. In any case, I am not so much worried about the actual
display of the characters as much as writing the correct data into the
file.

So what is the problem? It seems obvious in that case that you
should use UTF-8. If you're under Linux, too, you can be sure
that the basic execution character set is something ASCII based,
so that all characters in the basic execution set will have the
same encodings as in ASCII (and thus, as in UTF-8). I wouldn't
take the risk of using character and string constants for
anything else, however; I'd not use character constants for
anything else, and I'd use something like "\xC3\xA9" for
"&eaigu;".

tushar.saxena · Dec 6, 2007

Well that was part of the problem that I faced James. I wasn't quite
sure how to write the unicode sequences to file. It has been resolved
now though, I wrote a small function to encode the Unicode characters
to UTF-8 and I'm writing that to file.

Thanks again everyone for all your help.

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
trying to strip out non ascii.. or rather convert non ascii	38	Oct 26, 2013
Questions on various string literals in c++0x	1	Dec 7, 2010
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
parsing non-ascii characters	2	Nov 10, 2008
ASCII characters in a string gets converted, why?	1	Dec 18, 2008
How do I automate the removal of all non-ascii characters from mycode?	2	Sep 12, 2011
Interpreting non-ascii characters.	3	Jul 17, 2007

Displaying Non-ASCII Characters in C++

tushar.saxena

Alf P. Steinbach

Rahul

James Kanze

tushar.saxena

Thomas Dickey

James Kanze

James Kanze

tushar.saxena

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads