R
Richard Smith
I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:
int i = 'foo';
I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.
Using GCC on i386, I find that
'foo' == ('f' << 16 | 'o' << 8 | 'o');
Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.
It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?
Richard
literals -- that is, something that looks like a character literal,
but contains more than one character:
int i = 'foo';
I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.
Using GCC on i386, I find that
'foo' == ('f' << 16 | 'o' << 8 | 'o');
Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.
It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?
Richard