Multi-character constants

M

Mirco Wahab

After reading through some (open) Intel (CPU detection)
C++ source (www.intel.com/cd/ids/developer/asmo-na/eng/276611.htm)
I stumbled upon a sketchy use of multibyte characters

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

260:
unsigned int VendorID[3] = {0, 0, 0};
try // If CPUID instruction is supported
{
...
}
catch (...)
{
...
}
return (
(VendorID[0] == 'uneG') &&
(VendorID[1] == 'Ieni') &&
(VendorID[2] == 'letn')
);

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This seems to work, gcc 4.2 emits a warning:

"warning: multi-character character constant"

and Visual C++ 9 says nothing at all.

Whats the matter w/multibyte characters now?
I didn't use them and would be glad to learn
if they are widely implemented and part of
the standard soon/now?

gcc tells us: (http://gcc.gnu.org/onlinedocs/gcc/Characters-implementation.html)
...
[Characters]
...
The value of a wide character constant containing more than
one multibyte character, or containing a multibyte character
or escape sequence not represented in the extended execution
character set (C90 6.1.3.4, C99 6.4.4.4).
...



Regards & Thanks for clearing this

M.
 
J

James Kanze

After reading through some (open) Intel (CPU detection)
C++ source (www.intel.com/cd/ids/developer/asmo-na/eng/276611.htm)
I stumbled upon a sketchy use of multibyte characters
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
260:
unsigned int VendorID[3] = {0, 0, 0};
try // If CPUID instruction is supported
{
...
}
catch (...)
{
...
}
return (
(VendorID[0] == 'uneG') &&
(VendorID[1] == 'Ieni') &&
(VendorID[2] == 'letn')
);
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This seems to work, gcc 4.2 emits a warning:
"warning: multi-character character constant"
and Visual C++ 9 says nothing at all.
Whats the matter w/multibyte characters now?

First, do you mean multi-byte characters (e.g. UTF-8), or
multicharacter literals. Your example doesn't contain any
multi-byte characters, only multicharacter literals.
I didn't use them and would be glad to learn if they are
widely implemented and part of the standard soon/now?

Multicharacter literals are a holdover from the original C. As
far as I can tell, they have no use, and are of no interest
whatsoever. And what they mean is implementation defined. All
of which is probably why g++ warns about them.

Multi-byte characters are becoming more and more frequent as
applications shift to UTF-8, for reasons of
internationalization. True support is still spotty, but getting
there; the next version of the standard will require it (to some
degree---there still won't be functions like isdigit which work
on them).
gcc tells us: (http://gcc.gnu.org/onlinedocs/gcc/Characters-implementation.html)
...
[Characters]
...
The value of a wide character constant containing more than
one multibyte character, or containing a multibyte character
or escape sequence not represented in the extended execution
character set (C90 6.1.3.4, C99 6.4.4.4).
...

Implementation defined behavior is required to be documented by
the implementation. In this case, you've cut the only
significant bit, a link to the implementation defined behavior,
where you'll find:

The compiler values a multi-character character constant
a character at a time, shifting the previous value left
by the number of bits per target character, and then
or-ing in the bit-pattern of the new character truncated
to the width of a target character. The final
bit-pattern is given type int, and is therefore signed,
regardless of whether single characters are signed or
not (a slight change from versions 3.1 and earlier of
GCC). If there are more characters in the constant than
would fit in the target int the compiler issues a
warning, and the excess leading characters are ignored.

For example, 'ab' for a target with an 8-bit char would
be interpreted as `(int) ((unsigned char) 'a' * 256 +
(unsigned char) 'b')', and '\234a' as `(int) ((unsigned
char) '\234' * 256 + (unsigned char) 'a')'.

(Technically, this documentation only applies to C, I think.
But I would be very surprised if C++ did differently.)

But since this is implementation defined, the above is only
valid for gcc (although it does seem to be a frequent behavior).
 
J

James Kanze

Mirco Wahab wrote:

[...]
gcc tells us:
(http://gcc.gnu.org/onlinedocs/gcc/Characters-implementation.html)
...
[Characters]
...
The value of a wide character constant containing more than
one multibyte character, or containing a multibyte character
or escape sequence not represented in the extended execution
character set (C90 6.1.3.4, C99 6.4.4.4).
...
The are part of C++ since before the first Standard, IIRC.
The problem with them, however, is that the order of the bytes
in memory depends on the endianness of the system (or other
factors). Also, they don't have the type 'char', they have
the type 'int' and their representation is
implementation-defined (see [lex.ccon]/1).

They were part of K&R C. Where a character literal always had
type int. Even in C, however, the only place I've seen them
used was for generating the "magic" for certain types of files
in very early Unix. (Presumably, the author of the code "knew"
what his compiler did.) They're one of those misfeatures which
we can't get rid of for reasons of backwards compatibility.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top