S
Seebs
[...]I recently spent quite a while tracking down a bug.
What finally revealed to me what was happening was:
We want to know details.
Okay, this one is a bit of-topic, but part of the curse of C is that
sometimes what's going wrong ISN'T portable.
What was submitted to me was a bug report that some half-baked test of
POSIX functionality was failing on some embedded targets running Linux.
There's a bit of a spoiler in the previous post, but I'll just present this
as is.
Okay, the basic test as written was that the application creates a bunch
of stuff, including malloced memory, static variables, environment variables,
and so on. Then it calls fork().
DIGRESSION: fork() is a UNIXism. What you need to know is that
this is the Unixy way to make new processes. After a successful
call to fork(), you have two copies of your program. It doesn't
load a different program; it just duplicates the current running
program's memory space and state. The system call returns 0 if
you're in the child process, and the new process's process ID (which
is greater than zero) if you're in the parent. (Or -1 if it failed.)
After this, the child process then checks that all the values are the same
as they were in the parent. Well. The test of the malloced object fails.
Asked the testers for more detail, and they reported that if they run
the compiler on this test case directly, it works. So I got asked to stare
at it some more.
Lemme quote you, approximately, the code in question. I quote this because
it's totally irrelevant, but it's just awkward enough to make you wonder
whether it's really irrelevant:
void *malloced
malloced = (void *) malloc(sysconf(_SC_PAGESIZE));
*(double *) malloced = 2.3;
[...]
if (*(double *) malloced != 2.3) {
/* error here */
}
(sysconf(_SC_PAGESIZE) turns out to be 4096).
Okay, was there a prototype for malloc in scope? Yes there was. Was
sysconf(_SC_PAGESIZE) yielding meaningful results? Yup. Okay, so next
up I came up with "some stupid floating point exactness bug", so I tried
2.125, and later 3.
I extracted the test program's build instructions (not trivial, the build
script hid them), extracted the other files it #included, and so on.
Eventually got a reproducer. After a bit more testing, I got to the
following conclusion: It was "-O2" breaking it. If I compiled without
optimization, it ran fine.
Okay, now what?
1. Make "malloced" a double *, allocate sizeof(double) space for it.
No change.
2. Change it to 3.0 instead of 2.3. No change.
3. printf("%f\n", *malloced) => yields 3.000000.
4. ... ooookay, then. Let's try printing the value "*malloced != 3.0".
Sure enough, that's 1.
5. Precision problems? Print %a. Get 1.8p+1. Well, that didn't help.
So, after a bit more, I finally decide to try something crazy:
6. printf("%f\n", 3.0);
That yields... 0.000000. AH-HAH!
Now you see the marble in the oatmeal. We've been assuming that the malloced
memory changed value somehow, but what if the thing it was COMPARED to changed
value? Well. At -O2, it turns out, the compiler notices that there's more
than once incidence of "3.0" in the source code, and combines them into a
single constant. In a floating point register.
The actual problem:
Apparently, under some circumstances, programs which do not contain any
identifiable floating point operations do not get marked as needing their
floating point registers saved on process context switch. Or properly
copied into a child process. The actual problem is that a program which
doesn't contain any real floating point operations is relying on a register
which the kernel doesn't realize it should be saving, apparently. I hand
the bug to the kernel team with my analysis, and a day or so later someone
backports a fix from another kernel branch that handles this case.
This isn't especially C-related, but this gets to the kind of thing that
I find to be one of the harder areas of C debugging; learning what *kinds* of
things are happening once optimizers are involved, and also remembering that
on a modern machine, the chances are quite good that your program has been
removed from the CPU entirely and put back later several times a second, or
more, during ordinary operation.
We tend to have this view of "the state of the machine", but it's a convenient
fiction. There is no machine, there is no state. Heck, modern x86 CPUs
don't actually execute x86 code, in general, they translate it at execution
time into a completely different set of microcode so they can execute it
out of order.
As always, there is a tension between the drive to know how things really
work, and the need to remember that in practice you don't really know how
they work because the model is still just a fiction you made up to let you
model something.
-s