List of undefined behaviour and other sneeky bugs

Jens Gustedt · May 3, 2012

Am 05/03/2012 09:46 PM, schrieb Keith Thompson:

You have an unnamed union. Some compilers support this as an extension,
but it's not valid in standard C.

C11 has anonymous structures and unions (6.7.2.1 p 13)

Jens

John Reye · May 3, 2012

By the way: sometimes there might be good reasons to use unions.
Here is an example to set the most significant byte of an unsigned
long long to 0x12U.
I'm pretty sure that nothing will be faster than this ->

typedef unsigned long long my_ull;

union ACCESS_FAST {
my_ull v;
unsigned char ca[sizeof(my_ull)];
};

my_ull var;
((union ACCESS_FAST *) &var)->ca[sizeof(union ACCESS_FAST)-1] =
0x12U;
printf("%llx\n", var);

Is there any problem with this code?

Alternative:

union ACCESS_FAST var2;
var.ca[sizeof(var2.ca)-1] = 0x12U;
printf("%llx\n", var2.v);

Event better is this, which uses the macro MSB_LVALUE:

#include <stdio.h>

#define MSB_LVALUE(var) (((unsigned char*)&(var))[sizeof(var)-1])

int main(void)
{
typedef unsigned long long my_ull;

my_ull var2 = (my_ull)-1ULL;
MSB_LVALUE(var2) = 0x12U;
printf("%llx\n", var2);

unsigned i = (unsigned)-1U;

MSB_LVALUE(i) = 0x12U;
printf("%x\n", i);
}

Nothing wrong with this, right?
Thanks.

Ben Pfaff · May 3, 2012

John Reye said:
So if you use a particular union access member... for a particular
union in memory, then you should always access that particular union-
variable, with the same member?

Is that the only intended use?

I believe that is the intention. I couldn't find relevant text
in the Rationale.

In particular: is it a misuse of unions, to use them in order to
enable different memory-accesses (word, byte, etc.) to the same
portion of memory???? I have (mis)used it like this in above posts
quite often.
Is that strictly wrong?

I think there might be an "out" for accessing any type as an
array of unsigned char this way, but I believe that in general
this is strictly wrong.

I use "memcpy" to achieve a similar effect.

John Reye · May 3, 2012

By the way: sometimes there might be good reasons to use unions.
Here is an example to set the most significant byte of an unsigned
long long to 0x12U.
I'm pretty sure that nothing will be faster than this ->

Click to expand...

typedef unsigned long long my_ull;

Click to expand...

union ACCESS_FAST {
my_ull v;
unsigned char ca[sizeof(my_ull)];
};

Click to expand...

my_ull var;
((union ACCESS_FAST *) &var)->ca[sizeof(union ACCESS_FAST)-1] =
0x12U;
printf("%llx\n", var);

Click to expand...

Is there any problem with this code?

Alternative:

Click to expand...

union ACCESS_FAST var2;
var.ca[sizeof(var2.ca)-1] = 0x12U;
printf("%llx\n", var2.v);

Click to expand...

Event better is this, which uses the macro MSB_LVALUE:

#include <stdio.h>

#define MSB_LVALUE(var) (((unsigned char*)&(var))[sizeof(var)-1])

int main(void)
{
typedef unsigned long long my_ull;

my_ull var2 = (my_ull)-1ULL;
MSB_LVALUE(var2) = 0x12U;
printf("%llx\n", var2);

unsigned i = (unsigned)-1U;

MSB_LVALUE(i) = 0x12U;
printf("%x\n", i);

}

Nothing wrong with this, right?
Thanks.

Ah sorry... the above does NOT set the MSB. It would happen to do that
only on little endian machines, but not on big endian.
The code above rather does something else: It sets the largest-
addressed bytes inside the variable.

I cannot see anyway to optimize access to the real MSB (most signif.
byte) in a platform-independant way... in the same way as a union
would provide, unless I also check endianness.

If I use the "normal" way, there always seems to be a lot of shifting
involved.

But anyway... here is a nice macro hack, that does the shifting etc.
for all unsigned types:

#include <stdio.h>
#include <limits.h>

#define GET_TYPE_INT_OR_LARGER_0(var) (var & 0)

#define GET_SHIFT_CHAR_BIT(var) ((sizeof(var) >= sizeof(int)) ?
CHAR_BIT : (CHAR_BIT * (sizeof(int)-sizeof(var)+1)))

/* works only with unsigned */
#define SET_MSB_RVALUE(var, msb) ((var &
((GET_TYPE_INT_OR_LARGER_0(var) -1U) >> GET_SHIFT_CHAR_BIT(var))) |
((GET_TYPE_INT_OR_LARGER_0(var) + msb) << (CHAR_BIT*(sizeof(var)-1))))

#define MSB_VAL 0X12U

int main(void)
{
typedef unsigned long long my_ull;

my_ull var2 = (my_ull)-1ULL;
var2 = SET_MSB_RVALUE(var2, MSB_VAL);
printf("%llx\n", var2);

unsigned i = (unsigned)-1U;

i = SET_MSB_RVALUE(i, MSB_VAL);
printf("%x\n", i);

unsigned short s = (unsigned short)-1U;
s = SET_MSB_RVALUE(s, MSB_VAL);
printf("%x\n", s);

unsigned char c = (unsigned char)-1U;
c = SET_MSB_RVALUE(c, MSB_VAL);
printf("%x\n", c);

return 0;
}

You may be interested in the tricks used, to make the macro work for
all unsigned types... from "unsigned char" to "unsigned long long".

James Kuyper · May 3, 2012

On 05/03/2012 05:11 PM, John Reye wrote:
....

So if you use a particular union access member... for a particular
union in memory, then you should always access that particular union-
variable, with the same member?

Is that the only intended use?

Not quite. In C99, the standard was changed to describe how essentially
all real-world implementations handled unions. Reading a different
member than the last one written now sometimes has defined behavior, so
long as it's not a trap representation. Specifically, it results in
reinterpretation of the bits that were set by the write, in accordance
with the representation for the member being read. Since the standard
gives implementations a great deal of freedom with how values of
different types are represented, code which takes advantage of that fact
can never be portable. However, if you're willing to restrict the
portability of your code to platforms where the relevant types are
represented the way that your code assumes they are represented, it
could be useful code - for those platforms.

In particular: is it a misuse of unions, to use them in order to
enable different memory-accesses (word, byte, etc.) to the same
portion of memory???? I have (mis)used it like this in above posts
quite often.
Is that strictly wrong?

Whenever you read a member that is larger than the last one written to,
the bytes that are not part of of the smaller member will have
unspecified values. In that case, it's not sufficient to know how the
different types are represented on this particular implementation of C.
You'll also need to know precisely what that particular compiler does
with such code. The portability of such code is correspondingly more
restricted.

Keith Thompson · May 3, 2012

John Reye said:
Will I really get garbage? I think not. The lowest-addressed byte is
0x12, and the other bytes of mystruct.u.b are what they were before,
right?
A trap? How can that possibly be justified? Why does the C language
have unions then? What is the intended use of unions?

No, the other bytes are not *necessarily* what they were before.

N1570 6.2.6.1p7 (I think it's already been quoted in this thread):

When a value is stored in a member of an object of union type,
the bytes of the object representation that do not correspond to
that member but do correspond to other members take unspecified
values.

If that other member's type has trap representations, then the
unspecified byte values could cause that member to take on a trap
representation.

If you really want to store 0x12 in the lowest-addressed byte of an int
without touching the other bytes, you can do it like this:

*(char*)&int_obj = 0x12;

But that's a somewhat odd thing to do. What goal are you trying to
achieve by storing 0x12 in the lowest-addressed byte of an int? If
you'll tell us that, perhaps we can help you find a better solution.

Keith Thompson · May 3, 2012

Jens Gustedt said:
Am 05/03/2012 09:46 PM, schrieb Keith Thompson:

C11 has anonymous structures and unions (6.7.2.1 p 13)

Interesting.

Here's the paragraph (quoting N1570):

An unnamed member whose type specifier is a structure specifier
with no tag is called an *anonymous structure*; an unnamed member
whose type specifier is a union specifier with no tag is called
an *anonymous union*. The members of an anonymous structure or
union are considered to be members of the containing structure
or union. This applies recursively if the containing structure
or union is also anonymous.

Nobody · May 4, 2012

You forgot the "PDP-endian" case: 16-bit values are stored as
low-byte followed by high-byte, while 32-bit values are stored as
high-word followed by low-word (i.e. 0x01020304 is stored as
0x0102,0x0304 which is stored as 0x02,0x01,0x04,0x03).

For a 4-byte value, there are 24 possible permutations, and a conforming
implementation could use any of them.

John Reye · May 4, 2012

No. That's wrong.
Concerning
i = a = i+1;
Assignment, is not a sequence point.
The above code does not say whether the value of (i+1)
is stored first in (a) or in (i).

This is defined:

for (i = 0; i < sizeof a / sizeof *a; ++i) {
a = i + 1; /* set a to {1, 2, 3, ...} */
}

I believe that is incorrect.
The sequence point only makes a difference for statements that contain
side-effects.
So this is not allowed
a = i++;
because = is no sequence point, and contains side-effects (i++).
See http://c-faq.com/expr/seqpoints.html

But this is very much allowed:
i = a = i+1;
It uses the normal fact assignment is right-associative and that no
side-effects exist.

Also (less importantly): it may be more efficient than doing
a = i+1;
i++;

since I am helping an exessively "simple" compiler understand, that it
need not load i into a register for the 2nd line (to then be
incremented). Instead, the incremented value (i+1) may already be in a
register, and (after the loop... if i is still needed) ready to be
written to the memory-location of i.

James Kuyper · May 4, 2012

No. That's wrong.
Concerning
ï¿½ ï¿½ i = a = i+1;
Assignment, is not a sequence point.
The above code does not say whether the value of (i+1)
is stored first in (a) or in (i).

This is defined:

ï¿½ ï¿½ for (i = 0; i < sizeof a / sizeof *a; ++i) {
ï¿½ ï¿½ ï¿½ ï¿½ a = i + 1; ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ /* set a to {1, 2, 3, ...} */
ï¿½ ï¿½ }

Click to expand...

I believe that is incorrect.
The sequence point only makes a difference for statements that contain
side-effects.
So this is not allowed
a = i++;
because = is no sequence point, and contains side-effects (i++).
See http://c-faq.com/expr/seqpoints.html

But this is very much allowed:
i = a = i+1;
It uses the normal fact assignment is right-associative and that no
side-effects exist.

That statement reflects a common misunderstanding of what "side-effect"
means. You might think that the fact that i=3 sets the value of i to 3
is the main effect; the fact that i=3 is itself an expression with a
value of 3 is less important (it's certainly not a fact that I use very
often). However, what the standard sees things the other way around. It
says:

Accessing a volatile object, modifying an object, modifying a file, or calling a function
that does any of those operations are all _side effects_, (5.1.2.3p2)

Click to expand...

I've used underscores to mark the fact that "side effect" is in italics,
which is the standard's convention for marking this sentence as the
definition of the italicized term. The expression "i = a = i + 1"
modifies two objects: 'i' and a, so it does indeed have side effects.

In C99, the expression "i = a = i + 1" had undefined behavior because
the value of i was both stored by this expression, and read for purposes
other than determining the value to be stored in 'i', without any
intervening sequence points to determine which of those events occurs
first. The exception for determining the value to be stored allows "i =
i + 1", since the value of 'i' must be read in order to calculate the
value of 'i+1' which is to be stored in i.

However, in the case of 'i = a = i+1", the value of 'i' is also read
to determine which element of 'a' gets updated. It's unclear whether the
value of i should be updated before or after determining which element
to update. The C committee could have chosen to specify the sequence,
but they reached the conclusion that such code is inherently confusing,
and therefore should be discouraged. They did this by not merely leaving
the sequence unspecified, but explicitly stating that the behavior of
your entire program is undefined if it uses such code. The standard
never says that you can't do something; it only tells you what you can
and cannot depend upon happening, if you should choose to do something.
In this case, the standard imposes NO requirements on the behavior of a
program containing such code; you can't even count on it being allowed
to compile.

This is not a hard requirement to work around. Just make the desired
sequence crystal clear by writing either:

a = i + 1;
i = a;

or

i = i + 1;
a = i;

whichever one matches what you actually intended i=a= i+1 to do.

The relevant clause in C99 was one of the most confusingly worded parts
of that standard, and provoked a lot of argument. In C2011, I gather
that this whole issue has been more precisely defined by rewording
things in terms of the "visible sequence of side effect" and by
specifying that the evaluation of each expression may be "sequenced
before", "sequenced after", or "indeterminately sequenced" relative to
the evaluation of other expressions. The C2011 wording is greatly
complicated by the fact that it must now deal with the possibility of
multi-threaded code. As a result, I've not had time to review exactly
what the new standard says about such things. The new wording might make
the behavior of this code well-defined; but I doubt it - I think the
committee still believes that such code is inherently confusing, and
therefore should still be discouraged.

Old Wolf · May 7, 2012

A trap? How can that possibly be justified? Why does the C language
have unions then? What is the intended use of unions?

Unions are for saving memory, when you know you
need to store a value of some sort, but the value
could be one of various types. If there were no
unions you'd either have to have a struct (wasting
memory), or use a byte buffer and write some ugly
cast and alignment stuff.

They're not supposed to be a hack to do type conversions,
although that hasn't stopped people trying that over the years

Jens Gustedt · May 7, 2012

Unions are for saving memory, when you know you
need to store a value of some sort, but the value
could be one of various types. If there were no
unions you'd either have to have a struct (wasting
memory), or use a byte buffer and write some ugly
cast and alignment stuff.

They're not supposed to be a hack to do type conversions,
although that hasn't stopped people trying that over the years

This is not hacking this is foreseen as such by the standard:

- Unions have been the only way to guarantee that an object is
correctly aligned for a collection of types. (Since C11 there are also
_Alignas and friends.)
- Other than casting pointers, unions avoid aliasing problems.

Jens

Tim Rentsch · May 8, 2012

Yes. But the you won't be able to use the 'b' element of
the union since reading a different member than has been
set the last time round also invokes undefined behavior.

That's wrong. Reading a member other than the last member
set has defined behavior. In cases where the member read
has a size larger than the last member set (as is the case
here), the value read depends on unspecified values in
some bytes. If those unspecified values cause the object
representation for the type in question to be a trap
representation, then that will cause undefined behavior.
But simply reading a member other than the last member
set is not in and of itself undefined behavior. And the
usual case, ie, reading a member whose size is no larger
than that of the member set, commonly has well-defined
and sometimes useful behavior.

Tim Rentsch · May 8, 2012

John Reye said:
Thanks for the warning about sizeof!! I missed that.

Yes. But the you won't be able to use the 'b' element of
the union since reading a different member than has been
set the last time round also invokes undefined behavior.

Click to expand...

Oh no, please no. Why on earth would that be undefined behaviour?
[snip]

It isn't. On platforms that have trap representations
it is possible for undefined behavior to arise, but
that's because of reading a trap representation, not
because of reading a union member other than the last
one set.

Tim Rentsch · May 8, 2012

John Reye said:
Thanks for the exact C-standard reference.
Still: this is very unsettling.

Would this solve the issue ?? ->

struct
{
char a;
union {
int b;
struct {
char byte0;
char byte1;
char byte2;
char byte3;
};
};
} mystruct;
mystruct.byte0 = 0x12;
int tmp = mystruct.b;
f(b);

If it does solve the issue, and the lowest-address-byte of tmp is
0x12, then:
Ahhh it is not portable. Other ints have only 2 bytes.
Is there a portable way of fixing this??

If for some reason you want to set just the first byte of an
int, it can be done thusly:

union {
char c[ sizeof (int) ];
int i;
} stuff;

stuff.i = <some value>;
stuff.c = <first-byte value>;
... use stuff.i ...

Tim Rentsch · May 8, 2012

Ben Pfaff said:
I believe that is the intention. I couldn't find relevant text
in the Rationale.

If you read the Defect Report (sorry, I don't know the
number offhand) that prompted the famous "type punning"
footnote, I think you'll find that the type punning use
cases were expected and intended all along, eg, even in
C89/C90. I presume they were important to make defined
because a significant amount of pre-ANSI C code relied
on them working.

In particular: is it a misuse of unions, to use them in order to
enable different memory-accesses (word, byte, etc.) to the same
portion of memory???? I have (mis)used it like this in above posts
quite often.
Is that strictly wrong?

Click to expand...

I think there might be an "out" for accessing any type as an
array of unsigned char this way, but I believe that in general
this is strictly wrong. [snip]

How do you reconcile this view with the statement in
clear and plain English about how union member access
works to accomplish type punning? I think anyone who
takes the time to read through the relevant sections
defining the semantics involved will find accessing a
union member is defined irrespective of which member
was last set (although the result may depend on
unspecified values and consequently undefined behavior
because of trap representations, but the access itself
is well-defined).

Tim Rentsch · May 8, 2012

James Kuyper said:
On 05/03/2012 05:11 PM, John Reye wrote:
...

Not quite. In C99, the standard was changed to describe how essentially
all real-world implementations handled unions.

I believe this statement is a mispreresentation of the actual
history; a more accurate statement is that the wording was
changed but the intended behavior remained the same. There
is evidence for this view in the Defect Report that prompted
the footnote accompanying 6.5.2.3 p3 (which is currently #95
in N1570, but the footnote was also present in N1256, #82).

Reading a different
member than the last one written now sometimes has defined behavior, so
long as it's not a trap representation. Specifically, it results in
reinterpretation of the bits that were set by the write, in accordance
with the representation for the member being read. Since the standard
gives implementations a great deal of freedom with how values of
different types are represented, code which takes advantage of that fact
can never be portable. However, if you're willing to restrict the
portability of your code to platforms where the relevant types are
represented the way that your code assumes they are represented, it
could be useful code - for those platforms.

And there are some signifcant cases where the representations
are guaranteed to match, regardless of platform.

Whenever you read a member that is larger than the last one written to,
the bytes that are not part of of the smaller member will have
unspecified values. In that case, it's not sufficient to know how the
different types are represented on this particular implementation of C.
You'll also need to know precisely what that particular compiler does
with such code. The portability of such code is correspondingly more
restricted.

That's right, and another important case might be mentioned.
If the value written was a struct, then any padding bytes
of the struct (and padding bits, presumably) take unspecified
values. So unless what is being read matches up with (or
is a subset of) one of the members of the struct type written,
the same advisories about unspecified values and compiler
dependency apply.

Tim Rentsch · May 8, 2012

So that you can store values of different types in the
same location. But, of course, it hardly makes sense to
expect that after one has written a value of type A in-
to that location to be able to read something of a dif-
ferent type from it. Or do you often put a handkerchief
into a drawer and, when you open it again, it's a rabbit?

Besides the useful cases that are platform dependent,
there are also cases where the same representation
is guaranteed between different types, on all platforms.

Tim Rentsch · May 8, 2012

Old Wolf said:
A trap? How can that possibly be justified? Why does the C language
have unions then? What is the intended use of unions?

Click to expand...

Unions are for saving memory, when you know you
need to store a value of some sort, but the value
could be one of various types. If there were no
unions you'd either have to have a struct (wasting
memory), or use a byte buffer and write some ugly
cast and alignment stuff.

They're not supposed to be a hack to do type conversions,
[snip]

In fact they were, and are. And that's why the Standard
defines how such cases work.

Tim Rentsch · May 8, 2012

pete said:
John said:

1)
a = i++;

http://c-faq.com/expr/evalorder1.html

So this attempt at optimisation is wrong:

int i;
int a[10];
for (i = 0; i < sizeof(a); )
a = i++; /* set a to {1, 2, 3, ...} */

Correctly optimized, it should be:
for (i = 0; i < sizeof(a); )
i = a = i+1; /* set a to {1, 2, 3, ...} */

Click to expand...

No. That's wrong.
Concerning
i = a = i+1;
Assignment, is not a sequence point.
The above code does not say whether the value of (i+1)
is stored first in (a) or in (i). [snip alternative]

The behavior of

i = a = i+1;

is well-defined. I refer you to 6.5.16 p3 in N1570, specifically
the second-to-last sentence. By the time the store into 'i'
commences, the value (lvalue) computation of 'a' has already
completed, because it is sequenced before delivering the value
to be stored into 'i'.

The stores to 'i' and 'a' may happen in any order, but the
lvalue 'a' must be determined before any store into 'i'.

setjmp and undefined behaviour	2	Nov 30, 2013
Adding adressing of IPv6 to program	1	Feb 16, 2023
Bounds Checking as Undefined Behaviour?	29	Jul 29, 2010
Undefined behaviour ?	11	May 6, 2009
Array of structs function pointer	10	Jul 16, 2023
strange behaviour of realloc()	3	May 23, 2013
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Defined and undefined C pointer manipulation	34	Apr 26, 2014

List of undefined behaviour and other sneeky bugs

Jens Gustedt

John Reye

Ben Pfaff

John Reye

James Kuyper

Keith Thompson

Keith Thompson

Nobody

John Reye

James Kuyper

Old Wolf

Jens Gustedt

Tim Rentsch

Tim Rentsch

Tim Rentsch

Tim Rentsch

Tim Rentsch

Tim Rentsch

Tim Rentsch

Tim Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads