Byte ordering and array access

  • Thread starter Benjamin M. Stocks
  • Start date
R

Robin Haigh

stathis gotsis said:
That was food for thought but i think you went too low-level. Yes, memory
hides all internal implementation details, collects the 8-bits of a byte,
which maybe scattered on the chip, and gives the byte. I believe that the
real question is whether we can access pieces of data smaller than bytes in
a real memory? If we cannot then all possible processor-specific
endianness-es are the ways we can put two or more bytes in some memory
piece.


You need two different orderings before you can discuss how they relate to
each other.

When you store a 16-bit unsigned integer value into 2 bytes of
byte-addressable memory (and this didn't arise before byte-addressing), by
common custom and convention (but no absolute rule) you encode it base-256,
i.e. the byte values you store will be x/256 and x%256.

On that assumption, you now have an ordering by significance -- one byte is
the "big" byte -- and also an ordering by memory address, so you can talk
about which byte (by significance) is the low-address byte, i.e. endianness.

With the bits involved in bitwise operations, you have an ordering by
significance, but only that. There's no low-address end or left-hand end or
any other positional description. You can certainly access the LSB, but
every way of doing so refers to it by significance, essentially. So you
can't talk about relative bit-ordering, because you can't see anything for
it to be relative to.

Of course this changes when you serialise the bits in a byte onto a serial
communications line. Then, you do have another ordering, so the hardware
does have to agree on the bit-endianness and reassemble the byte values as
transmitted. But, unlike the cpu vendors, the bus and network vendors (by
some miracle) do have this all sorted out, and we don't actually get to see
bit-swapped bytes, so we treat it as a non-issue. The danger that you fear
was potentially real, but has been averted.


The terms "left-shift" and "right-shift" are motivated by the fact that in
America and many other countries, when numbers are written down in
place-value notation, we write the big end on the left. If numbers were
normally written the other way round, e.g. 000,000,1 for a million, the
names would have been reversed. This hasn't got anything to do with cpu
architecture.
 
C

Chris Torek

Well, in my previous example i allowed someone to split the 2-byte whole,
into 4-bit entities. I was wondering if that can happen in real-life
systems. Is the byte the atom (the smallest entity that cannot be further
split) in the context of endianness? If it is, then my example resides in
the field of imagination.

So, the question becomes, is there a con "struct" in C that will
allow you to ask someone/something to split a value into, say,
"two:4" bit pieces? :)

struct S {
unsigned int a:4, b:4, c:4, d:4;
} x = { 7, 1, 8, 10 };

The next question might be: "who is doing the splitting?" (There
are two possible answers. Either the compiler is doing it all on
its own, as directed by whoever wrote that compiler; or the compiler
is doing it with some assistance from the CPU. In the first case,
the compiler-writer chooses the endian-ness. In the second, the
compiler-writer colludes with the chip-maker to choose the
endianness. Note that even chip designers sometimes change their
minds: the numbering of bits on the 680x0 is different for the Bxxx
instructions on the original 68000, and the BFxxx instructions that
were added to the 68020 or 68030 [I forget which].)
The question you should ask yourself is: who is this entity that
is splitting up your whole, and why are you giving him, her, or it
permission to do so? What will he/she/it do with the pieces? Who
or what will re-assemble them later, and will all the various
entities doing this splitting-up and re-assembling cooperate?

If *you* do the splitting-up yourself:

unsigned char split[4];
unsigned long value;

split[0] = (value >> 24) & 0xff;
split[1] = (value >> 16) & 0xff;
split[2] = (value >> 8) & 0xff;
split[3] = value & 0xff;

and *you* do the re-assembling later:

value = (unsigned long)split[0] << 24;
value |= (unsigned long)split[1] << 16;
value |= (unsigned long)split[2] << 8;
value |= (unsigned long)split[3];

will you co-operate with yourself? Will that guarantee that you
get the proper value back?
I think i will get the proper value back.

Indeed you will.

The same applies to bitfields.

C's bitfields are very tempting to the embedded-systems programmer
writing, e.g., a SCSI or USB IO system (it turns out that USB is
essentially "SCSI over serial lines", as far as protocol goes
anyway). SCSI -- and hence USB -- disk commands and responses
are full of sub-byte fields. C's bitfields *appear* to map to
SCSI bitfields ... but if you use them for this, you give up all
control to the compiler and/or CPU, and those may not arrange your
fields the way you intended.

If you write out explicit shift-and-mask code, it will work on
every system that is actually capable of supporting the hardware.
 
J

Joe Wright

Robin said:
You need two different orderings before you can discuss how they relate to
each other.

When you store a 16-bit unsigned integer value into 2 bytes of
byte-addressable memory (and this didn't arise before byte-addressing), by
common custom and convention (but no absolute rule) you encode it base-256,
i.e. the byte values you store will be x/256 and x%256.

On that assumption, you now have an ordering by significance -- one byte is
the "big" byte -- and also an ordering by memory address, so you can talk
about which byte (by significance) is the low-address byte, i.e. endianness.

With the bits involved in bitwise operations, you have an ordering by
significance, but only that. There's no low-address end or left-hand end or
any other positional description. You can certainly access the LSB, but
every way of doing so refers to it by significance, essentially. So you
can't talk about relative bit-ordering, because you can't see anything for
it to be relative to.

Of course this changes when you serialise the bits in a byte onto a serial
communications line. Then, you do have another ordering, so the hardware
does have to agree on the bit-endianness and reassemble the byte values as
transmitted. But, unlike the cpu vendors, the bus and network vendors (by
some miracle) do have this all sorted out, and we don't actually get to see
bit-swapped bytes, so we treat it as a non-issue. The danger that you fear
was potentially real, but has been averted.


The terms "left-shift" and "right-shift" are motivated by the fact that in
America and many other countries, when numbers are written down in
place-value notation, we write the big end on the left. If numbers were
normally written the other way round, e.g. 000,000,1 for a million, the
names would have been reversed. This hasn't got anything to do with cpu
architecture.
Very good Robin. You have nailed it well. Our conventional number system
is said to be 'Arabic'. This is not because we might recognize the digit
4 in Arabic, but because numbers are written right to left, low order
first with place and value reserved for the concept of zero.
 
S

stathis gotsis

You need two different orderings before you can discuss how they relate to
each other.

When you store a 16-bit unsigned integer value into 2 bytes of
byte-addressable memory (and this didn't arise before byte-addressing), by
common custom and convention (but no absolute rule) you encode it base-256,
i.e. the byte values you store will be x/256 and x%256.

On that assumption, you now have an ordering by significance -- one byte is
the "big" byte -- and also an ordering by memory address, so you can talk
about which byte (by significance) is the low-address byte, i.e.
endianness.


Yes, i was curious if there are other encodings in real systems, other than
this common one, leading to other possibilities for endianness.
With the bits involved in bitwise operations, you have an ordering by
significance, but only that. There's no low-address end or left-hand end or
any other positional description. You can certainly access the LSB, but
every way of doing so refers to it by significance, essentially. So you
can't talk about relative bit-ordering, because you can't see anything for
it to be relative to.

So, i come to the conclusion that shifting operations hide endianness from
the programmer. Maybe one could reveal endianness this way?

#include <stdio.h>

int main(void)
{
int i=0;
unsigned int a = 0xabcdabcd;
unsigned char *b;
b=(unsigned char *)&a;

while (i<sizeof(a))
{
printf("%d byte: %x\n",i+1,b);
i++;
}

return 0;
}
Of course this changes when you serialise the bits in a byte onto a serial
communications line. Then, you do have another ordering, so the hardware
does have to agree on the bit-endianness and reassemble the byte values as
transmitted. But, unlike the cpu vendors, the bus and network vendors (by
some miracle) do have this all sorted out, and we don't actually get to see
bit-swapped bytes, so we treat it as a non-issue. The danger that you fear
was potentially real, but has been averted.
The terms "left-shift" and "right-shift" are motivated by the fact that in
America and many other countries, when numbers are written down in
place-value notation, we write the big end on the left. If numbers were
normally written the other way round, e.g. 000,000,1 for a million, the
names would have been reversed. This hasn't got anything to do with cpu
architecture.

Yes, that is clear to me now.
 
R

Rod Pemberton

stathis gotsis said:
So, i come to the conclusion that shifting operations hide endianness from
the programmer. Maybe one could reveal endianness this way?

#include <stdio.h>

int main(void)
{
int i=0;
unsigned int a = 0xabcdabcd;
unsigned char *b;
b=(unsigned char *)&a;

while (i<sizeof(a))
{
printf("%d byte: %x\n",i+1,b);
i++;
}

return 0;
}


There are two standard methods to determine endianess. See the code below.
There are big-endian and little endian machines. Old 16-bit little-endian
machines (VAX, PDP-11) became middle-endian in a 32-bit world. Will the
same to little endian 32-bit Intel CPU's in a 64-bit world? And what will
they call it? middle-middle-endianess?

Wiki on Endianess
http://en.wikipedia.org/wiki/Endianness

Rod Pemberton
----

#include <stdio.h>
union { long Long; char Char[sizeof(long)]; } u;

int main (void)
{
/* Method 1 */
int x = 1;

if ( *(char *)&x == 1)
printf("Register addressing is right-to-left,"
" LSB stored in memory first, \n "
"or little endian - memory addressing "
"decreases from MSB to LSB.\n");
else
printf("Register addressing is left-to-right,"
" MSB stored in memory first, \n "
"or big endian - memory addressing "
"increases from MSB to LSB .\n");

/* Method 2 */
u.Long = 1;
if (u.Char[0] == 1)
printf("Register addressing is right-to-left,"
" LSB stored in memory first, \n "
"or little endian - memory addressing "
"decreases from MSB to LSB.\n");
else if (u.Char[sizeof(long)-1] == 1)
printf("Register addressing is left-to-right, "
"MSB stored in memory first, \n "
"or big endian - memory addressing "
"increases from MSB to LSB .\n");
else printf("Addressing is strange\n");

return(0);
}

/* MSB - most significant byte */
/* LSB - least significant byte */
/* */
/* little endian - the endian (i.e.,LSB) is at the little address in memory
*/
/* memory addressing decreases from MSB to LSB */
/* LSB stored in memory first */
/* register addressing is right-to-left */
/* big endian - the endian (i.e.,LSB) is at the big address in memory */
/* memory addressing increases from MSB to LSB */
/* MSB stored in memory first */
/* register addressing is left-to-right */
/* */
/* big little endian order */
/* 0123 0123 memory addresses 0,1,2,3 */
/* ASDF ASDF memory order, character A stored at 0, etc... */
/* */
/* ASDF FDSA register order, MSB...LSB */
/* M..L M..L MSB is bits 24-31 */
/* S..S S..S LSB is bits 0-7 */
/* B..B B..B */
/* */
/* big endian order improves string processing and */
/* eliminates the need for special string instructions */
/* thereby reducing the instruction set for the cpu (RISC) */
/* but arithmetic and branching need more circuitry to adjust */
/* for the changing location of the LSB depending on data size */
/* Words, or double words can be loaded into a register */
/* and the string byte ordering remains the same */
/* little endian order improves arithmetic and branching by */
/* keeping the LSB in the same location independant of data */
/* size and eliminates the need for exta math circuitry */
/* but needs string instructions which creates larger */
/* instruction set for the cpu (CISC) */
/* Words, or double words loaded into a register using */
/* integer instructions (non string instructions) reverses */
/* the string byte order i.e., AS -> SA, ASDF -> FDSA */
 
K

Kenny McCormack

....
There are two standard methods to determine endianess. See the code below.
There are big-endian and little endian machines. Old 16-bit little-endian
machines (VAX, PDP-11) became middle-endian in a 32-bit world. Will the
same to little endian 32-bit Intel CPU's in a 64-bit world? And what will
they call it? middle-middle-endianess?

Not portable. Can't discuss it here. Blah, blah, blah.

(Like mental illness and union SREGS...)
 
K

Keith Thompson

Joe Wright said:
Very good Robin. You have nailed it well. Our conventional number system
is said to be 'Arabic'. This is not because we might recognize the digit
4 in Arabic, but because numbers are written right to left, low order
first with place and value reserved for the concept of zero.

<OT>
Yes, our numbering system is called "Arabic" (or "Hindu-Arabic") --
but it's precisely because the Europeans adopted the system from the
Arabs, including the appearance of the digits.

Google "Hindu-Arabic numbers" for details.

Our confusion over big-endian vs. little-endian numeric
representations probably goes back to the fact that Arabic is written
right-to-left, most European langauges are written left-to-write, but
Europe adopted Arabic numbers without changing the order in which
they're written. (I'm not 100% certain on that last point.)
</OT>
 
S

stathis gotsis

There are two standard methods to determine endianess. See the code below.
There are big-endian and little endian machines. Old 16-bit little-endian
machines (VAX, PDP-11) became middle-endian in a 32-bit world. Will the
same to little endian 32-bit Intel CPU's in a 64-bit world? And what will
they call it? middle-middle-endianess?

Wiki on Endianess
http://en.wikipedia.org/wiki/Endianness


#include <stdio.h>
union { long Long; char Char[sizeof(long)]; } u;

int main (void)
{
/* Method 1 */
int x = 1;

if ( *(char *)&x == 1)
printf("Register addressing is right-to-left,"
" LSB stored in memory first, \n "
"or little endian - memory addressing "
"decreases from MSB to LSB.\n");
else
printf("Register addressing is left-to-right,"
" MSB stored in memory first, \n "
"or big endian - memory addressing "
"increases from MSB to LSB .\n");

/* Method 2 */
u.Long = 1;
if (u.Char[0] == 1)
printf("Register addressing is right-to-left,"
" LSB stored in memory first, \n "
"or little endian - memory addressing "
"decreases from MSB to LSB.\n");
else if (u.Char[sizeof(long)-1] == 1)
printf("Register addressing is left-to-right, "
"MSB stored in memory first, \n "
"or big endian - memory addressing "
"increases from MSB to LSB .\n");
else printf("Addressing is strange\n");

return(0);
}

/* MSB - most significant byte */
/* LSB - least significant byte */
/* */
/* little endian - the endian (i.e.,LSB) is at the little address in memory
*/
/* memory addressing decreases from MSB to LSB */
/* LSB stored in memory first */
/* register addressing is right-to-left */
/* big endian - the endian (i.e.,LSB) is at the big address in memory */
/* memory addressing increases from MSB to LSB */
/* MSB stored in memory first */
/* register addressing is left-to-right */
/* */
/* big little endian order */
/* 0123 0123 memory addresses 0,1,2,3 */
/* ASDF ASDF memory order, character A stored at 0, etc... */
/* */
/* ASDF FDSA register order, MSB...LSB */
/* M..L M..L MSB is bits 24-31 */
/* S..S S..S LSB is bits 0-7 */
/* B..B B..B */
/* */
/* big endian order improves string processing and */
/* eliminates the need for special string instructions */
/* thereby reducing the instruction set for the cpu (RISC) */
/* but arithmetic and branching need more circuitry to adjust */
/* for the changing location of the LSB depending on data size */
/* Words, or double words can be loaded into a register */
/* and the string byte ordering remains the same */
/* little endian order improves arithmetic and branching by */
/* keeping the LSB in the same location independant of data */
/* size and eliminates the need for exta math circuitry */
/* but needs string instructions which creates larger */
/* instruction set for the cpu (CISC) */
/* Words, or double words loaded into a register using */
/* integer instructions (non string instructions) reverses */
/* the string byte order i.e., AS -> SA, ASDF -> FDSA */

Thank you very much Rod.
 
J

Joe Wright

Keith said:
<OT>
Yes, our numbering system is called "Arabic" (or "Hindu-Arabic") --
but it's precisely because the Europeans adopted the system from the
Arabs, including the appearance of the digits.

Google "Hindu-Arabic numbers" for details.

Our confusion over big-endian vs. little-endian numeric
representations probably goes back to the fact that Arabic is written
right-to-left, most European langauges are written left-to-write, but
Europe adopted Arabic numbers without changing the order in which
they're written. (I'm not 100% certain on that last point.)
</OT>
<OT>
If you look at the ten Arabic numerals in Arabic, you will not recognize
many if any. And none of it has to do with endianness. That's a
Lilliputian thing about eggs and which end of them to open. :)
</OT>
 
R

Robin Haigh

stathis gotsis said:
endianness.


Yes, i was curious if there are other encodings in real systems, other than
this common one, leading to other possibilities for endianness.

The issue that the standard goes out of its way to mention is padding bits.
Presumably there was a reason for this. Padding bits are hidden in the
integer value, but exposed in the byte-values.

so, once you've decided to allow for padding bits, it would be tortuous and
pointless to try to say anything else about byte-encoding. Arbitrary
padding makes the issue much more general -- byte-swapping and bit-shuffling
are reduced to special cases. The byte-values can be much more complex
functions of the integer value being encoded, so you may as well just say
they can be any reversible function, regardless of what hardware is out
there. Code will be portable so long as it doesn't try to access the bytes
of the object representation by address (other than for simple copying), and
it won't if it does.

end

So, i come to the conclusion that shifting operations hide endianness from
the programmer.

I would have said that endianness disappears when you fetch values from
byte-addressed memory into the processor. For an operation not to "hide
endianness", you would have to have some way of saying which is the "left"
or perhaps "leading" end independently of significance.

Maybe one could reveal endianness this way?

#include <stdio.h>

int main(void)
{
int i=0;
unsigned int a = 0xabcdabcd;
unsigned char *b;
b=(unsigned char *)&a;

while (i<sizeof(a))
{
printf("%d byte: %x\n",i+1,b);
i++;
}

return 0;
}


Yes, you can do that. The output will be platform-dependent, or worse if
there are padding bits. You will get clues (though not totally unambiguous
information) about the way your platform splits an unsigned int into bytes
and the order of those bytes in memory.

You won't learn anything about the bitwise physical storage of byte-values
in memory bytes or unsigned values in the processor. Bits are only exposed
to the programmer as powers of two ordered by significance. (Obviously
these "value bits" correspond to hardware bits in all normal hardware. But
the standard doesn't actually specify the hardware or depend on it being
normal -- on a really bizarre architecture, e.g. not based on powers of 2,
the real bits would have to be hidden and the value bits of the standard
would have to be emulated)

Highly non-portable code uses the above type-punning method to write binary
values into fixed external data formats such as network packets. As several
people said earlier, you can do the equivalent job portably using either
arithmetic operations or shift/mask operations.
 
C

CBFalconer

Keith said:
.... snip ...

Our confusion over big-endian vs. little-endian numeric
representations probably goes back to the fact that Arabic is
written right-to-left, most European langauges are written
left-to-write, but Europe adopted Arabic numbers without
changing the order in which they're written. (I'm not 100%
certain on that last point.) </OT>

I am five and thirty percent sure of something or other. I.e.
European languages have not been devoid of endian wars in the past.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
 
C

CBFalconer

Robin said:
.... snip ...

Highly non-portable code uses the above type-punning method to
write binary values into fixed external data formats such as
network packets. As several people said earlier, you can do the
equivalent job portably using either arithmetic operations or
shift/mask operations.

That's because shift operations are defined in terms of values, and
multiplication or division by 2. That's also why you should limit
shift operations to unsigned ints.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
 
K

Keith Thompson

CBFalconer said:
I am five and thirty percent sure of something or other. I.e.
European languages have not been devoid of endian wars in the past.

Yoda fan club member of I am.

Or, to quote a bumper sticker I saw a reference to some years ago:

4TH [HEART] IF HONK THEN

Once I get that time machine I keep talking about, I'm going to get
the original CPU designers together and get them to settle on one
consistent endianness. I don't care which, just pick one.
 
O

ozbear

<OT>
Yes, our numbering system is called "Arabic" (or "Hindu-Arabic") --
but it's precisely because the Europeans adopted the system from the
Arabs, including the appearance of the digits.

Google "Hindu-Arabic numbers" for details.

Our confusion over big-endian vs. little-endian numeric
representations probably goes back to the fact that Arabic is written
right-to-left, most European langauges are written left-to-write, but
Europe adopted Arabic numbers without changing the order in which
they're written. (I'm not 100% certain on that last point.)
</OT>

Slightly backwards...while Arabic text is written right to left,
numbers in Aribic are written left to right.

oz
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,942
Members
47,490
Latest member
Finplus

Latest Threads

Top