Need help with printing Unicode! (C++ on CentOS)

Z

Zerex71

Zerex71 <[email protected]> kirjutas:


Zerex71 <[email protected]> kirjutas:
Also, help me understand in your example how my code 0x266D gets
turned into "\xE2\x99\xAD".
Presumably this is UTF-8 encoding of your character.
One thing is the encoding your source file uses, and the other is
what you want to output. I'm not familiar with Eclipse so I cannot
comment on the former. If needed, you can use iconv() to convert from
your encoding to UTF-8.
The following program works for me on a SuSE Linux and produces some
kind of music sign on the console. My locale is LANG=en_US.utf8.
#include <stdio.h>
int main() {
  const unsigned char test[4]={0xE2, 0x99, 0xAD, 0};
  printf("Test: %s\n", test);
}
hth
Paavo
I just tried that but it did not work for me - but, I'm running the
console output to the Eclipse console tab, not within an xterm.

Can you check what is your LANG setting in this console? Maybe you should
turn to an Eclipse forum?

hth
Paavo

I don't know if there is a LANG setting for this console, but I will
check. I can check the encoding of the files/project, but that's
about it. I'll see what I can find for this.
 
Z

Zerex71

Zerex71 <[email protected]> kirjutas:





Ok so far.


I think the OS really does not care, at least the Linux OS. The visual
output is produced by certain applications, e.g. xterm. The application
has to know the encoding of the input data it receives. The encoding of
the file can be sometimes extracted from the file, like in case of XML
files or BOM markers in Unicode files, and sometimes it is determined
otherwise, like by the locale settings.


This is internal business of the visual application or the X window
system (don't know the details).


This is internal business of the visual application or the X window
system (don't know the details).


This is internal business of the visual application or the X window
system (don't know the details).


This is internal business of the visual application or the X window
system (don't know the details).




Maybe, but hardly anything of this is relevant. The terminal program
expects some kind of encoding, and you have to provide it. In Linux the
encoding is usually UTF-8. If all your files use the same UTF-8 encoding
internally, then there is no problem, you just output the data. If you
still insist on using wchar_t and UCS-4 encoding internally, then you
have to perform the translation by yourself. It's as simple as that.

hth
Paavo

Well, I'm trying to ascertain how much of the problem needs to be
fixed in the code and/or in the output environment. I thought that by
making the output use wide format that that would solve the problem,
but apparently not. Right now I am trying to find out if a font
change in the output console is in order, but I still maintain that my
selected font is capable of displaying these characters properly, so
I'm assuming I'm doing something wrong in the code. However, the fact
that I am getting things like "Here's your character: ???" is somewhat
encouraging, in that it is attempting to print it out, but can't fetch
a suitable glyph for a variety of reasons.

Incidentally, in Java, I didn't have this problem. I was able to use
its Unicode facilities and life was easy, once I figured out how to do
it. I can get it to print most chars. When I went back to look for
that old code which I knew I'd done, I realized I didn't ever try to
do this in C++, and even if I had, it was on WinXP.
 
Z

Zerex71

  Yet you have already spent days asking about it in this newsgroup. If
you had googled about it instead and read a few pieces of documentation,
you would have probably saved yourself a lot of trouble.

  (Not that it's wrong to ask here or anywhere else for help. It's just
that your attitude feels a bit picky. When someone suggests a relatively
easy solution to your problem you dismiss it without even trying to see
how that solution works.)


  Unicode and its encodings are, unfortunately not a simple matter.
Fortunately people have already gone through the trouble and offer free
libraries to do the hard part.


  You don't have to install it. It's just a set of header files. You put
it anywhere your compiler will find them (eg. inside your project
directory) and then just #include the appropriate header and start using
it. I gave you a simple example of the usage.

  Don't immediately dismiss a solution just because you don't understand
it in 10 seconds.

And don't immediately dismiss someone because they don't have the
interest or inclination in spending a lot of time for what seems like
it should be a simple answer. Moreover, don't lecture me. See
Paavo's posts for an example of how to do this.
 
Z

Zerex71

Zerex71 <[email protected]> kirjutas:





Ok so far.


I think the OS really does not care, at least the Linux OS. The visual
output is produced by certain applications, e.g. xterm. The application
has to know the encoding of the input data it receives. The encoding of
the file can be sometimes extracted from the file, like in case of XML
files or BOM markers in Unicode files, and sometimes it is determined
otherwise, like by the locale settings.


This is internal business of the visual application or the X window
system (don't know the details).


This is internal business of the visual application or the X window
system (don't know the details).


This is internal business of the visual application or the X window
system (don't know the details).


This is internal business of the visual application or the X window
system (don't know the details).




Maybe, but hardly anything of this is relevant. The terminal program
expects some kind of encoding, and you have to provide it. In Linux the
encoding is usually UTF-8. If all your files use the same UTF-8 encoding
internally, then there is no problem, you just output the data. If you
still insist on using wchar_t and UCS-4 encoding internally, then you
have to perform the translation by yourself. It's as simple as that.

hth
Paavo

I don't understand your statement that hardly any of this is
relevant. I am describing to you my understanding of how characters
are stored and displayed, and am asking for corrections on the model.
It's a closely related tangent to my problem of direct screen output
(not file-based, because I don't yet have that issue to deal with).
 
Z

Zerex71

Zerex71 <[email protected]> kirjutas:







In Unix they say everything is file-based. And what is relevant is the
communication between you (your program) and the one who is listening to
you (in this case xterm or "eclipse console tab"). Yes it is nice to know
how the things go further, in what exact format the fonts are stored and
how the LCD display would make them appear in color, how the cone
receptors in the eye are transforming this into the nerve impulses, how
the brain visual cortex is interpreting the signals and translating back
to visually indistinguished characters, how they are further interpreted
as symbols carrying a specific meaning - yes, that would be nice!

FWIW, I suspect that Unicode fonts are not stored as flat lookup arrays,
rather as piecewise arrays, because of size considerations. But I'm not
at all sure I'm right here.

Paavo

Okay, I tried the following code:

const unsigned char test[4]={0xE2, 0x99, 0xAD, 0}; // This is the
conversion from 0x266d; NUL to pad
printf("Test: %s\n", test);

and it didn't work for me. I get the output: "Test: ���". I did some
more digging on Eclipse and apparently there is a startup option and
also the same option for each defined run configuration:

-Dfile.encoding=UTF8

I'm not sure that I want to change the file encoding, only the output
encoding.

I am also trying to find out how to generate a binary in Eclipse that
can be run outside of Eclipse i.e. the program executable so that I
can just run it in an xterm to see what happens. I am cross-posting
to their forums to see how to get this to work. If that fails, I may
fall back to the utf8 library referenced earlier.

Mike
 
J

Juha Nieminen

Zerex71 said:
And don't immediately dismiss someone because they don't have the
interest or inclination in spending a lot of time for what seems like
it should be a simple answer.

That's your problem: You want a solution but you are not ready to do
the necessary work to learn the solution. Even when someone outright
gives you a simple answer to your problem, you still immediately
dismissed it because you didn't go through the trouble of spending a few
minutes learning the solution.
Moreover, don't lecture me. See
Paavo's posts for an example of how to do this.

Which post? The one where he basically instructs you to make the UTF-8
encoding by hand?

Well, if you *really* want to encode all your strings to UTF-8 by
hand, then please go right ahead. I won't stop you.

On the other hand, you said you wanted an *easy* solution to this
problem. Using an encoding library is at least a hundred times easier
than trying to do the encoding yourself. But whatever floats your boat.

Using the library I mentioned, encoding one unicode character to UTF-8
is basically one single utf8::append() call. Encoding it by hand
requires quite a lot of work. You could, of course, write an encoder
yourself, but then you would be basically replicating utf8::append().
What would be the point? (Especially since you don't seem to have the
time for this.)

Since you seem to detest the library solution and prefer to make the
UTF-8 encoding yourself, please let me know how that worked for you. I'm
honestly curious.
 
Z

Zerex71

  That's your problem: You want a solution but you are not ready to do
the necessary work to learn the solution. Even when someone outright
gives you a simple answer to your problem, you still immediately
dismissed it because you didn't go through the trouble of spending a few
minutes learning the solution.


  Which post? The one where he basically instructs you to make the UTF-8
encoding by hand?

  Well, if you *really* want to encode all your strings to UTF-8 by
hand, then please go right ahead. I won't stop you.

  On the other hand, you said you wanted an *easy* solution to this
problem. Using an encoding library is at least a hundred times easier
than trying to do the encoding yourself. But whatever floats your boat.

  Using the library I mentioned, encoding one unicode character to UTF-8
is basically one single utf8::append() call. Encoding it by hand
requires quite a lot of work. You could, of course, write an encoder
yourself, but then you would be basically replicating utf8::append().
What would be the point? (Especially since you don't seem to have the
time for this.)

  Since you seem to detest the library solution and prefer to make the
UTF-8 encoding yourself, please let me know how that worked for you. I'm
honestly curious.

What I'm getting at is, is it really necessary for me to incorporate
all of that stuff just for three lines of code?
 
J

Juha Nieminen

Zerex71 said:
What I'm getting at is, is it really necessary for me to incorporate
all of that stuff just for three lines of code?

Are you planning on having more unicode data in your program than just
that one symbol? Or do you think in some future program you might want
more extensive support for unicode?

If your answer to either question was yes, then it definitely will pay
off learning how to handle unicode, UTF-8 and related libraries. It will
save you a lot of work in the future.

Note that handling UTF-8 encoded text directly (without ever
converting it to raw unicode values and back) is not always feasible in
all possible situations. For example, advancing in a UTF-8 encoded
string one character at a time is not trivial because UTF-8 is a
variable-length encoding: Some characters will take more than one byte
(between 2 and 4), and in fact, some of the characters can be composite
characters (in other words, composed of more than one unicode value).

Thus if you ever need to write a program which needs to distinguish
between different unicode characters (let's say, for example, count the
number of characters in a line), using an unicode/UTF-8 library will
make it enormously easier than trying to do it for yourself.
 
Z

Zerex71

Zerex71 <[email protected]> kirjutas:




Implementing an Unicode-to-UTF8 converter is not really so hard. It takes
about 20-30 lines of C code IIRC.




Not at all. If all your source and input files are in the right encoding,
there should be no need for doing anything. I am sure emacs can handle
files in UTF-8 encoding, don't know anything about Eclipse.

The reason why everything is UTF-8 in Linux is that all the string
interfaces are 8-bit ASCIIZ by historic reasons, with zero bytes used as
string terminators. UTF-8 fills in here perfectly, allowing to pass
Unicode content through such interfaces even if they have not devised
especially for that.

<rant>
On the other hand, Microsoft made a premature attempt to standardize 16-
bit Unicode, but landed on UTF-16 later when they realized 16 bits are
not enough, ending up with basically using the same trick of passing
variable-length elements through fixed-element-size interfaces which
already were present. Sadly, UTF-16 has no benefits over UTF-8
whatsoever, at least in Western countries. The final outcome for Windows
is that each SDK function having string arguments is present in 2
versions (narrow and wide), and there is a huge pile of nasty macros
trying to leverage that for the user programs.

Well, the thousands of programmers have to be kept busy by something,
right? I hope Linux developers do not have time for such nonsense.
</rant>

hth
Paavo

So, as per my posts, I set the file encoding to UTF-8...made sure my
environment is UTF-8 (Linux locale)...and am trying to determine how
to set the runtime (console) output in Eclipse to be UTF-8 (I have
posted to an Eclipse forum, still no response). I figure I shouldn't
have to do anything else in my code. I don't know why it's not
working.
 
Z

Zerex71

  Are you planning on having more unicode data in your program than just
that one symbol? Or do you think in some future program you might want
more extensive support for unicode?

  If your answer to either question was yes, then it definitely will pay
off learning how to handle unicode, UTF-8 and related libraries. It will
save you a lot of work in the future.

  Note that handling UTF-8 encoded text directly (without ever
converting it to raw unicode values and back) is not always feasible in
all possible situations. For example, advancing in a UTF-8 encoded
string one character at a time is not trivial because UTF-8 is a
variable-length encoding: Some characters will take more than one byte
(between 2 and 4), and in fact, some of the characters can be composite
characters (in other words, composed of more than one unicode value).

  Thus if you ever need to write a program which needs to distinguish
between different unicode characters (let's say, for example, count the
number of characters in a line), using an unicode/UTF-8 library will
make it enormously easier than trying to do it for yourself.

I'm not planning on doing anything extensive with Unicode, which is
why I'm not pursuing a more encompassing route. Obviously if I had a
big Unicode issue on my hand, I would have started to think about a
more long-range solution. I definitely would be thinking of a UTF-8
library or incorporating some functionality but right now this is just
on the nit level.
 
J

Juha Nieminen

Zerex71 said:
I'm not planning on doing anything extensive with Unicode, which is
why I'm not pursuing a more encompassing route.

Well, suit yourself.

Maybe it's just me, but I still don't find using the library so
difficult as you make it sound. And once you have learned it, using it
again in the future will be a breeze.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,159
Messages
2,570,879
Members
47,414
Latest member
GayleWedel

Latest Threads

Top