Unicode Initialization.

M

Me

I am trying to compile some code Ive gotten from another and
I know I need a 16 bit unicode string, for he passes the pointer to
functions
that take a (uint16 *), however there are initializations that look like
this.

typedef unsigned short int ucs2_char;

....
....
....

static const ucs2_char form_feed[] = L"\f";

The above like in gcc give me the compiler error: 'invalid initializer'

When I change it to the following, everything works fine.

static const ucs2_char *form_feed = L"\f";


What is up with this error?
 
S

Stephen Sprunk

Me said:
I am trying to compile some code Ive gotten from another and
I know I need a 16 bit unicode string, for he passes the pointer to
functions
that take a (uint16 *), however there are initializations that look like
this.

typedef unsigned short int ucs2_char;

The correct type for UCS2 characters is wchar_t. Fix the code to use the
correct type.
static const ucs2_char form_feed[] = L"\f";

The above like in gcc give me the compiler error: 'invalid initializer'

When I change it to the following, everything works fine.

static const ucs2_char *form_feed = L"\f";

What is up with this error?

What's up is you're using the wrong type; L"\f" is a wide character literal,
not an array of unsigned short ints. The latter should give you a warning
as well, since you're doing an implicit conversion between wchar_t[] and
unsigned short*, but your compiler may not be smart enough to catch that.


typedef unsigned short int ucs2_char;
static const ucs2_char form_feed[] = L"\f";
foo.c:2: warning: initialization from incompatible pointer type

typedef unsigned short int ucs2_char;
static const ucs2_char form_feed[] = L"\f";
foo.c:2: invalid initializer

#include <wchar.h>
static const wchar_t *form_feed = L"\f";
static const wchar_t form_feed[] = L"\f";
[ no compile warnings or errors ]

S
 
T

those who know me have no need of my name

in comp.lang.c i read:
The correct type for UCS2 characters is wchar_t.

wchar_t is something -- perhaps ucs-2 or utf-16, or something else entirely.
i agree wchar_t should be used, but if each character must be a ucs-2 code-
point then wchar_t is not appropriate, and neither should L"" be used for a
literal string.
 
S

Stephen Sprunk

those who know me have no need of my name said:
in comp.lang.c i read:

wchar_t is something -- perhaps ucs-2 or utf-16, or something else entirely.
i agree wchar_t should be used, but if each character must be a ucs-2 code-
point then wchar_t is not appropriate, and neither should L"" be used for a
literal string.

Good point; whcar_t is UCS-2 on every platform I've used so I didn't
consider it might differ on another platform. Either way, I think it's what
the original author (and our poster) intended to use, and it's the simplest
and most portable solution for dealing with Unicode.

S
 
L

Lew Pitcher

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Stephen said:
Good point; whcar_t is UCS-2 on every platform I've used so I didn't
consider it might differ on another platform.

FWIW, I believe that wchar_t can refer to one of the IBM
double-byte-character-set (DBCS) EBCDICs when used in IBM's C compiler
on the mainframe.


- --

Lew Pitcher, IT Consultant, Enterprise Application Architecture
Enterprise Technology Solutions, TD Bank Financial Group

(Opinions expressed here are my own, not my employer's)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFAxfSVagVFX4UWr64RAlt4AKDyngVYstrafTQ42C0mFIi3jdVo6gCfTrNf
RI88VNCppIvIsrV9LFTNPpk=
=uB9O
-----END PGP SIGNATURE-----
 
M

Me

Thanx for your input, but here is the problem.
First, I work for a big telecom company (you probably are using their
phone right now).
In my project I am porting the phone code to run in Linux so developers
can debug it.
The CDMA specification uses two byte unicode characters and much of the
code uses the L""
initializer.

They create a type called ucs2_char that is unsigned short.

I at first made the ucs2_char to be wchar_t but I found out that in linux
wchar_t is 4 bytes in size (4 byte unicode UTF-32).

What do I do....?.... any suggestions?

Also, is there a type in linux for a 2 byte unicode (UTF-16)?

And....is the L"" initializer, in Linux, only for 4 byte unicode or can I
configure this in gcc or linux?



Me said:
I am trying to compile some code Ive gotten from another and
I know I need a 16 bit unicode string, for he passes the pointer to
functions
that take a (uint16 *), however there are initializations that look like
this.

typedef unsigned short int ucs2_char;

The correct type for UCS2 characters is wchar_t. Fix the code to use the
correct type.
static const ucs2_char form_feed[] = L"\f";

The above like in gcc give me the compiler error: 'invalid initializer'

When I change it to the following, everything works fine.

static const ucs2_char *form_feed = L"\f";

What is up with this error?

What's up is you're using the wrong type; L"\f" is a wide character
literal,
not an array of unsigned short ints. The latter should give you a
warning
as well, since you're doing an implicit conversion between wchar_t[] and
unsigned short*, but your compiler may not be smart enough to catch that.


typedef unsigned short int ucs2_char;
static const ucs2_char form_feed[] = L"\f";
foo.c:2: warning: initialization from incompatible pointer type

typedef unsigned short int ucs2_char;
static const ucs2_char form_feed[] = L"\f";
foo.c:2: invalid initializer

#include <wchar.h>
static const wchar_t *form_feed = L"\f";
static const wchar_t form_feed[] = L"\f";
[ no compile warnings or errors ]

S
 
S

Stephen Sprunk

Me said:
I at first made the ucs2_char to be wchar_t but I found out that in linux
wchar_t is 4 bytes in size (4 byte unicode UTF-32).

What do I do....?.... any suggestions?

Also, is there a type in linux for a 2 byte unicode (UTF-16)?

And....is the L"" initializer, in Linux, only for 4 byte unicode or can I
configure this in gcc or linux?

-fshort-wchar will give you a 2-byte wchar_t (UTF-16, not UCS-2) with gcc
2.97 and later. I haven't tested whether this makes wide string literals
compatible with unsigned short *, but it seems likely.

Any further questions on gcc should be directed to gnu.gcc.help, but this
should get you started:
http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Code-Gen-Options.html#Code Gen Options

S
 
K

kal

Lew Pitcher said:
FWIW, I believe that wchar_t can refer to one of the IBM
double-byte-character-set (DBCS) EBCDICs when used in IBM's C compiler
on the mainframe.

Perhaps so in so far as size of characters (in bits) are concerned.
Even in this regard sometimes what are called DBCS are actually MBCS
(MultiByte Character Set.)

EBCDIC descended from punched cards. It went from 6-bit BCD to 8-bit
extended BCD (EBCDIC). But ASCII descended from telegraph. It went
from 5-bit telegraph codes to 7-bit ASCII to 8-bit ASCII etc. These
two schemes implement entirely different code points.

Now, wchar_t almost always refers to UCS-2 or UTF-16. The differences
between UCS-2 and UTF-16 have been worked out a few years ago and as
far as code values are concerned they are both the same at present.
The first 128 characters of these are the same as the 7-bit ASCII.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,143
Messages
2,570,822
Members
47,368
Latest member
michaelsmithh

Latest Threads

Top