Data Types Problem

B

Ben

Hey,

I can't figure out what's going on here. I'm trying have an unsigned
short pointer point to an array of chars, and I expect the numerical
value of the short to be the sum of the two next chars, where the
first one is left-shifted by 8 bits (see the code below). It seems
that the value is actually the sum where the SECOND char is left-
shifted by 8 bits, and I can't figure out why. Isn't the 'a' the most
significant byte in the value below? Can someone explain this to me?
Thanks.

#include <stdio.h>

int main(int argc, char *argv[] )
{ // main program

unsigned short
*lpw;
char
str[3] = "ab";

lpw = (unsigned short *)&str[0]; // make lpw point to the 'a' in str
printf("*lpw == %hu\n", *lpw ); // print out the value of *lpw, as an
unsigned short

// CORRECT ONE
printf("*lpw should be equal to: %d\n", 'a' + ('b' << 8) ); // print
out the equivalent value, by adding 'a' and 'b' shifted properly
// INCORRECT ONE
printf("*lpw should be equal to: %d\n", 'b' + ('a' << 8) ); // print
out the equivalent value, by adding 'a' and 'b' shifted properly

} // end main()
 
V

Victor Bazarov

Ben said:
I can't figure out what's going on here. I'm trying have an unsigned
short pointer point to an array of chars, and I expect the numerical
value of the short to be the sum of the two next chars, where the
first one is left-shifted by 8 bits (see the code below). It seems
that the value is actually the sum where the SECOND char is left-
shifted by 8 bits, and I can't figure out why.

What's the endianness of your system? If you don't know what it is, you
might want to read about it...
> Isn't the 'a' the most
significant byte in the value below? Can someone explain this to me?
[..]

V
 
B

Ben

Ben said:
I can't figure out what's going on here. I'm trying have an unsigned
short pointer point to an array of chars, and I expect the numerical
value of the short to be the sum of the two next chars, where the
first one is left-shifted by 8 bits (see the code below). It seems
that the value is actually the sum where the SECOND char is left-
shifted by 8 bits, and I can't figure out why.

What's the endianness of your system?  If you don't know what it is, you
might want to read about it...

 > Isn't the 'a' the most
significant byte in the value below? Can someone explain this to me?
[..]

V

Thanks, I figured it out. I also found this great article that
basically explained my exact code:
http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?ca=drs-#list3

As it turns out, my system is little-endian. I never thought
endianness was something programmers had to worry about - now I know
better.
 
J

joshuamaurice

What's the endianness of your system?  If you don't know what it is, you
might want to read about it...
 > Isn't the 'a' the most
significant byte in the value below? Can someone explain this to me?
[..]

Thanks, I figured it out. I also found this great article that
basically explained my exact code:http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?c...

As it turns out, my system is little-endian. I never thought
endianness was something programmers had to worry about - now I know
better.

You need to know more. Read:
http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
Arbitrary pointer casting like C-style casting or reinterpret_casting
from char* to short* \does not work\. It did not work in C. It does
not work in C++. Your code can break on gcc versions 3.4.1 and higher
with -O3. (I think that's the right version number and optimization
level.) For example:

void swap_words(int * x)
{ short * s = (short*)x;
short tmp = s[0];
s[0] = s[1];
s[1] = tmp;
}
int main()
{ if (sizeof(int) != 2 * sizeof(short)) return 1;
int x = 42;
swap_words(&x);
return x;
}

On my linux box with gcc 3.4.3, compile line g++ test.cpp, the program
returns 0. With compile line g++ test.cpp -O3, the program returns 42.
The C and C++ standards dictate that accessing an object through a
pointer of the wrong kind produces undefined behavior. For the most
part, using the result of a reinterpret_cast or a C-style cast which
cannot be rewritten as a static_cast produces undefined behavior.
Also, casting to void* and then casting the void* back to anything but
the \exact same type\, then using the result produces undefined
behavior. A pointer to base class or pointer to derived class is not
good enough.

Off the top of my head (emphasizing probably incomplete), the
exceptions are
-1- reading or writing a POD through a char* or unsigned char*. This
may not be explicitly allowed by the standard (there was a fun thread
on this topic earlier this month), but we believe it was the intent as
justified by numerous passages in the C++ standard.
-2- You can reinterpret_cast between a pointer to POD and a pointer to
type of its first element, either way, and use the pointers as
normal.
-3- reinterpret_casting between a pointer to POD and a pointer to a
different POD, as long as you only access the common leading part, if
any. Though this may just be accessing the common leading part if
they're both members of a union, but I imagine the stronger form is
intended, and I would think that any implementation which allows
exception 2 must allow exception 3.

If you really need to reinterpret_cast between things, then you can
use one of the above exceptions, or you can use one of these
alternatives:
- As an extension to the C standard, and either conforming to the C++
standard or as an extension to the C++ standard (depending on your
interpretation of its wording), most compilers allow writing to one
member of a union and reading from a different member, working as
expected.
- memcpy and the other c standard library functions (like memmove)
always work for POD types. Possibly the only option actually
guaranteed by the standard.

Let's take your program
#include <stdio.h>

int main(int argc, char *argv[] )
{
unsigned short *lpw;
char str[3] = "ab";
lpw = (unsigned short *)&str[0];
printf("*lpw == %hu\n", *lpw );
printf("*lpw should be equal to: %d\n", 'a' + ('b' << 8) );
printf("*lpw should be equal to: %d\n", 'b' + ('a' << 8) );
}

You could use the union extension to rewrite it correctly as:

#include <stdio.h>
int main(int argc, char *argv[] )
{
unsigned short *lpw;
char str[3] = "ab";

union { char c[2]; unsigned short s; }; //anonymous union
c[0] = str[0];
c[1] = str[1];
lpw = & s;

printf("*lpw == %hu\n", *lpw );
printf("*lpw should be equal to: %d\n", 'a' + ('b' << 8) );
printf("*lpw should be equal to: %d\n", 'b' + ('a' << 8) );
}

All optimizing compilers should eliminate the extra loads and stores
from the extra assignments to the members of the anonymous union,
making it just as fast as if you had gone to assembly or turned off
strict aliasing ala the gcc option -fno-strict-aliasing.

Finally, to be thorough, you're making several other assumptions which
are not portable. You're assuming that CHAR_BITS == 8, that there are
8 bits in a char. There might be more. You're then assuming that sizeof
(unsigned short) == 2, that there are 2 chars in an unsigned short.
This again may not be true.
 
J

joshuamaurice

Sorry, one last thing. When doing such type punning hackery, note that
all bit representations are valid values for char and unsigned char.
However, if you write to a short through a char*, and then read that
short, you may have written a trap representation or something,
possibly causing a signal which would kill your process. (Insert
various other scenarios.) Type punning can be implementation specific,
and you should know what you're doing.
 
J

James Kanze

I can't figure out what's going on here. I'm trying have an
unsigned short pointer point to an array of chars, and I
expect the numerical value of the short to be the sum of the
two next chars, where the first one is left-shifted by 8 bits
(see the code below).

Why would you expect something like that? Accessing elements in
an array of char's through a pointer to unsigned short is
undefined behavior. On at least one of the systems I use, it
can result in a core dump. Don't do it.
It seems that the value is actually the sum where the SECOND
char is left- shifted by 8 bits, and I can't figure out why.

Because it is undefined behavior. In general, you can't access
a type A through an lvalue of type B without undefined behavior.
(The one exception: you can access any type through an lvalue
of type char or unsigned char.)
Isn't the 'a' the most significant byte in the value below?
Can someone explain this to me? Thanks.
#include <stdio.h>
int main(int argc, char *argv[] )
{ // main program
unsigned short
*lpw;
char
str[3] = "ab";
lpw = (unsigned short *)&str[0]; // make lpw point to the 'a' in str

lpw can't point to the 'a' in str. The 'a' in str has type
char, and only a char* can point to it.

Note that the alignment requirements of unsigned short and char
might be different. It's quite possible that on some machines,
lpw could not possibly point to the 'a' of str, and it's
certainly the case that on most machines, trying to access
through lpw, if it does point to the 'a' in str, may cause a
core dump. (But since it's undefined behavior, only some of the
time. Depending, e.g. on what you've defined before or after
str.)

The conversion you do is a reinterpret_cast. It should only be
done in very low level code, and only if you really, really know
what you're doing.
printf("*lpw == %hu\n", *lpw ); // print out the value of *lpw, as an
unsigned short
// CORRECT ONE
printf("*lpw should be equal to: %d\n", 'a' + ('b' << 8) ); // print
out the equivalent value, by adding 'a' and 'b' shifted properly
// INCORRECT ONE
printf("*lpw should be equal to: %d\n", 'b' + ('a' << 8) ); // print
out the equivalent value, by adding 'a' and 'b' shifted properly

The correct way to do this is exactly what you've written:

unsigned short value = str[0] << 8 | str[1] ;
or
unsigned short value = str[1] << 8 | str[0] ;

depending on what you want.
 
J

James Kanze

The problem in your code isn't endianness. The problem is that
you have undefined behavior.
Isn't the 'a' the most
significant byte in the value below? Can someone explain this to me?
[..]
Thanks, I figured it out. I also found this great article that
basically explained my exact code:http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?c...

That's not a "great article". It's a good presentation of how
not to do things, full of hackers solutions which only work on a
few specific machines.
As it turns out, my system is little-endian. I never thought
endianness was something programmers had to worry about - now
I know better.

Endianness is normally something you don't have to worry about.
Unless you're doint something seriously wrong.
 
J

James Kanze

Sorry, one last thing. When doing such type punning hackery,
note that all bit representations are valid values for char
and unsigned char. However, if you write to a short through a
char*, and then read that short, you may have written a trap
representation or something, possibly causing a signal which
would kill your process. (Insert various other scenarios.)
Type punning can be implementation specific, and you should
know what you're doing.

He's also assuming that a pointer to a char will always be
sufficiently aligned to access an unsigned short. This is not
guaranteed on most of the systems I know---his code will likely
core dump on a Sun Sparc, for example.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,160
Messages
2,570,889
Members
47,422
Latest member
LatashiaZc

Latest Threads

Top