Please disprove this Double-Checked Locking "fix"

G

Gerhard Fiedler

Joshua said:
That may be how they're commonly implemented, but that's not the
guaranteed semantics. Two different mutexes may as a matter of fact on
a given implementation give "happens-before" effects between the two
different mutexes, but there's nothing guaranteed about it.

...

With that, it sees:
someMutex.lock();
<blah1>
someMutex.unlock();
<blah2>
someMutex.lock();
<blah3>
someMutex.unlock();

The compiler sees a lock, unlock, lock, unlock, in straightline code,
without branching (or exceptions, or volatile (to keep signal handlers
correct)). The compiler is totally free to replace that with:
someMutex.lock();
<blah1>
<blah2>
<blah3>
someMutex.unlock();

This is exactly what I don't understand -- I guess I have to look deeper
into the different mutex/lock implementations and their guarantees. I
was under the impression that through the lock/unlock pairs, it is
and that <blah3> happens said:
Now, back to my original much more controversial statement - can a
compiler simply remove a lock unlock pair? Ex:
mutex.lock();
mutex.unlock();
Maybe. I mentioned "clever ruse" with whole program optimization in
mind. (However, upon thinking about it, I just showed that you don't
even need whole program optimization.) Without whole program
optimization, I think no. Could someone please more educated weigh
in?

Thus far, after 10 minutes of attempts just now to write a conforming
race-free program where you could tell the difference if a compiler
simply removed an "empty" mutex acquire release pair, the only
programs I can find are ones that would deadlock before the change,
and not deadlock after the change. A deadlock is observable behavior,
so a compiler cannot remove it for that reason.

Which would mean that if the compiler wants to remove it, it has to be
able to prove that it can't deadlock. This would be a useful diagnostic
output :)

Thanks,
Gerhard
 
J

Joshua Maurice

This is exactly what I don't understand -- I guess I have to look deeper
into the different mutex/lock implementations and their guarantees. I
was under the impression that through the lock/unlock pairs, it is
guaranteed that <blah1> happens before <blah2>, and that <blah3> happens
after it.

Unless I'm totally off, this is a quite straightforward
transformation. It may be bad from a QoI perspective depending on the
particulars of the code, and it may be a bad optimization depending on
the particulars, but it is definitely allowed by the C++0x memory
model (and Java 1.5+ memory model) AFAIK.

You said that "I was under the impression that [...] it is guaranteed
that <blah1> happens before <blah2>, [...]". Why is that? More
particularly, what does that mean?

Let me try to explain. Taking a step back, consider:
int a = 1;
int b = 2;
global_c = a + b;
global_d = a - b;

We have no guarantees that the assignment to global_c happens before
global_d. Why? This is equivalent to the phrasing that no conforming
program could tell the difference if in fact the compiler "switched"
the two. It's the basic "as-if" rule.

Going back to:
someMutex.lock();
<blah1>
someMutex.unlock();
<blah2>
someMutex.lock();
<blah3>
someMutex.unlock();

Using the "as-if" rule, the compiler is totally free to transform that
to the following, under some rather specific rules.
someMutex.lock();
<blah1>
<blah2>
<blah3>
someMutex.unlock();

It can only do so under some very specific situations. The compiler
has to be able to guarantee that no control flow path goes out of that
block except the normal flow past the end. (Also volatile and signal
handlers might be annoying, so let's ignore that for now.) Now, think
of a program that could possibly tell the difference between those two
code fragments? AFAIK, you could not. Because of the vagueries of
scheduling, no conforming program would have its behavior changed in a
way that would produce results not allowed by the memory model, and
thus the compiler is entirely free to do that under the "as-if" rule.

To put that another way, the difference between the two programs is
that another thread might acquire the mutex when the first thread is
in <blah2>, and thus see <blah1> without being guaranteed to see
<blah2> nor <blah3>. That is, it is allowed for the compiler and
scheduler to do so. It is not required. It is perfectly allowed for a
"malicious" scheduler and implementation to guarantee that it will /
never/ allow another thread to acquire that mutex, forcing it to wait,
while a thread is in <blah2>. Your conforming program could not tell
the difference because you're not guaranteed that some other thread
will be scheduled at just the right time to acquire the mutex when the
first thread is in <blah2>.

As I said earlier, there are somewhat severe restrictions on when a
compiler could do this. If exceptions can be thrown, or early returns,
and such, the compiler has to be very careful to guarantee the same
"as-if" behavior. Also, gotos or loops could mess this up, especially
badly in light of QoI. IIRC, there's some fluff in the C++0x standard
that things must become visible to other threads "eventually", or even
"in a reasonable amount of time", and probably (?) something similar
along the lines that a thread waiting on a mutex will "eventually"
acquire it, or even "in a reasonable amount of time". If you combine
those two sections like that, you might have just broken the
"eventually" or "in a reasonable amount of time" guarantees.

Thus, as I mentioned earlier, as <blah1> and <blah2> were empty in the
earlier example, none of that can happen - it's safe to combine the
critical sections. And it's a clear win to combine the empty critical
section with the non-empty one. After that, the compiler can apply
another variation of the "as-if" rule to move the assignment to the
local variable from outside the critical section to inside the
critical section. Finally, it's free to use the normal single-thread
rules to remove the unnecessarily local variable.
 
G

Gerhard Fiedler

Joshua said:
Unless I'm totally off, this is a quite straightforward
transformation. It may be bad from a QoI perspective depending on the
particulars of the code, and it may be a bad optimization depending
on the particulars, but it is definitely allowed by the C++0x memory
model (and Java 1.5+ memory model) AFAIK.

FWIW, I was thinking of C++03 with pthread (GCC) or Windows (VC++)
threading models.
You said that "I was under the impression that [...] it is guaranteed
that <blah1> happens before <blah2>, [...]". Why is that? More
particularly, what does that mean?

Let me try to explain. Taking a step back, consider:
int a = 1;
int b = 2;
global_c = a + b;
global_d = a - b;

We have no guarantees that the assignment to global_c happens before
global_d. Why? This is equivalent to the phrasing that no conforming
program could tell the difference if in fact the compiler "switched"
the two. It's the basic "as-if" rule.

Understood and agreed -- without locks. I was under the impression that
locks change this, by adding "observable" points to the program.
Going back to:
someMutex.lock();
<blah1>
someMutex.unlock();
<blah2>
someMutex.lock();
<blah3>
someMutex.unlock();

Using the "as-if" rule, the compiler is totally free to transform that
to the following, under some rather specific rules.
someMutex.lock();
<blah1>
<blah2>
<blah3>
someMutex.unlock();

It can only do so under some very specific situations. The compiler
has to be able to guarantee that no control flow path goes out of that
block except the normal flow past the end. (Also volatile and signal
handlers might be annoying, so let's ignore that for now.) Now, think
of a program that could possibly tell the difference between those two
code fragments? AFAIK, you could not.

How about the typical situation: short-running, locked synchronizing
code with long-running, unlocked working code? blah1 (retrieving tasks
out of a shared queue) and blah3 (placing results into a shared queue)
each a few milliseconds, blah2 (calculating the results) a few hours or
days, and a few threads doing this in parallel. I'm sure I can tell the
difference between the two versions.
Because of the vagueries of scheduling, no conforming program would
have its behavior changed in a way that would produce results not
allowed by the memory model, and thus the compiler is entirely free
to do that under the "as-if" rule.

I understand that this may be a QoI issue, but then again, if I can't
guarantee that the two days that blah2 runs are in fact unlocked, I
really can't use this system for this task. QoI of the compiler here
becomes QoI of the application, and if the compiler is free to do this,
it doesn't allow me to guarantee my application spec -- which I assume
to be a typical threaded application.
To put that another way, the difference between the two programs is
that another thread might acquire the mutex when the first thread is
in <blah2>, and thus see <blah1> without being guaranteed to see
<blah2> nor <blah3>. That is, it is allowed for the compiler and
scheduler to do so. It is not required.

In my example, which is nothing strange, it is required, and to
implement it, I need to use a compiler with a threading/memory model
that guarantees me that blah2 runs unlocked, so that it can run in
parallel. This of course if the scheduler allows it to do so -- but
since the compiler doesn't control the scheduler, it can't really assume
much about it and IMO needs to unlock after blah1, even though it
doesn't know what the scheduler does with this. Whether it makes sense
to unlock and relock around blah2 /I/ need to decide when writing the
program, considering the particularities of the scheduler I'm intending
to run it on.
IIRC, there's some fluff in the C++0x standard that things must become
visible to other threads "eventually", or even "in a reasonable
amount of time", and probably (?) something similar along the lines
that a thread waiting on a mutex will "eventually" acquire it, or
even "in a reasonable amount of time". If you combine those two
sections like that, you might have just broken the "eventually" or
"in a reasonable amount of time" guarantees.

This doesn't sound good :) This relates to my earlier comment that I
think I need to look into the guarantees that the individual
implementations provide -- if this here reflects the C++03 guarantees,
IMO they alone are not very useful in practice. To go back to the
example, I need to have the guarantee that blah2 runs unlocked (within
the overall scheduling granularity of the target system, of course).

Thanks,
Gerhard
 
J

Joshua Maurice

FWIW, I was thinking of C++03 with pthread (GCC) or Windows (VC++)
threading models.

For most intents and purposes, the memory models of C++0x, POSIX
pthreads, win32, and Java 1.5 and higher, are the same. There's some
important differences between them (like Java "final" keyword, and the
Java "out of thin air" guarantee), but for the basics like visibility
guarantees w.r.t. mutexes, it's the same all around.
You said that "I was under the impression that [...] it is guaranteed
that <blah1> happens before <blah2>, [...]". Why is that? More
particularly, what does that mean?
Let me try to explain. Taking a step back, consider:
  int a = 1;
  int b = 2;
  global_c = a + b;
  global_d = a - b;
We have no guarantees that the assignment to global_c happens before
global_d. Why? This is equivalent to the phrasing that no conforming
program could tell the difference if in fact the compiler "switched"
the two. It's the basic "as-if" rule.

Understood and agreed -- without locks. I was under the impression that
locks change this, by adding "observable" points to the program.

With multi-threading, the program no longer has exactly one possible
allowed execution. (In practice, I would guess most single threaded
programs have more than one allowed execution given things like
implementation defined behavior, and various QoI and system specs like
stack size, heap space, and so on.)

With multi-threading, this changes. No longer are you guaranteed a
particular execution. Instead, there is a set of allowed executions.
The set of allowed executions is quite annoying and complex to define
formally, but you are right that locks restrict this set. Some
executions are not allowed because they violate some of the ordering
rules of the memory model, such as the ordering rules required by
locks. However, you are still left with a rather large set of allowed
executions, and a conforming implementation is free to choose to do
whichever it wants.
How about the typical situation: short-running, locked synchronizing
code with long-running, unlocked working code? blah1 (retrieving tasks
out of a shared queue) and blah3 (placing results into a shared queue)
each a few milliseconds, blah2 (calculating the results) a few hours or
days, and a few threads doing this in parallel. I'm sure I can tell the
difference between the two versions.

I was using an idiomatic meaning of "tell the difference". Of course
one could tell the difference in practice between if a compiler does
tail-call recursion optimization and not, but a conforming program
would not have its formal observable behavior changed. That's what I
meant by "You couldn't tell the difference". I meant that a conforming
program with or without tail call recursion would still have its
formal observable behavior be one of the allowed behaviors.

Similarly, if the compiler does something "bad" like combine critical
sections when it shouldn't, you could tell the difference in practice,
just like you could tell the difference in practice between tail-call
recursion optimization and no tail-call recursion optimization.
Whether the compiler does or does not do those things is QoI. The
formal behavior with or without combining the critical sections is one
of the allowed behaviors.
I understand that this may be a QoI issue, but then again, if I can't
guarantee that the two days that blah2 runs are in fact unlocked, I
really can't use this system for this task.

Then you have a QoI issue. By the same token, you would have QoI
issues if the compiler otherwise did something else stupid. However,
combining or not combining those sections results in a formally
allowed observable behavior, and thus the compiler is allowed to do
it.
QoI of the compiler here
becomes QoI of the application, and if the compiler is free to do this,
it doesn't allow me to guarantee my application spec -- which I assume
to be a typical threaded application.

Then harass your compiler writer to fix it, or the scheduler writer to
fix it, or use a different compiler or scheduler. There are so many
ways that the implementation can be malicious. This is just one of
them.
In my example, which is nothing strange, it is required, and to
implement it, I need to use a compiler with a threading/memory model
that guarantees me that blah2 runs unlocked, so that it can run in
parallel.

You also need a compiler that properly does data flow analysis
optimizations, constant folding, constant propagation, common sub-
expression elimination, and so on. You also need a compiler that
doesn't simply produce a maliciously convoluted object file.

It's basically impossible for any general purpose portable memory
model to do as you want. How could you possibly specify that to work
with different kinds of scheduler, unknown schedulers that aren't even
written yet? It depends too much on the vagaries of the scheduler and
other such things. I wouldn't worry about it. At least, I would worry
about it as much as I would worry about stack sizes in programs. Most
of the time you don't need to. Sometimes you do, and you need to check
the outputted assembly.

Also, if you did do that, completely disallow combining nearby
critical sections, then you just removed some optimization
opportunities for IMHO no particularly good reason. I want the
compiler to be able to optimize the OP's problem by combining an empty
critical section with a nearby non-empty one.
This of course if the scheduler allows it to do so -- but
since the compiler doesn't control the scheduler, it can't really assume
much about it and IMO needs to unlock after blah1, even though it
doesn't know what the scheduler does with this. Whether it makes sense
to unlock and relock around blah2 /I/ need to decide when writing the
program, considering the particularities of the scheduler I'm intending
to run it on.

Again, perhaps it's bad QoI, and the compiler ought to fix it, but it
would result in an allowed formal observable behavior, an allowed
execution, and thus the compiler would be free to do it under the "as
if" rule.
This doesn't sound good :)  This relates to my earlier comment that I
think I need to look into the guarantees that the individual
implementations provide -- if this here reflects the C++03 guarantees,
IMO they alone are not very useful in practice.

C++03 doesn't have any relevant guarantees. AFAIK, this here reflects
all of the usual memory models (C++0x, POSIX pthreads, Java 1.5+,
win32).
To go back to the
example, I need to have the guarantee that blah2 runs unlocked (within
the overall scheduling granularity of the target system, of course).

As I said earlier, good luck attempting to write a portable
specification to that end, and you don't really "need" it. You need it
as much as you need more than 10 bytes of stack space, and you need it
as much as you need a compiler that won't maliciously pessimize your
programs. If you need more than that, you need something more than
what a portable standard can give you, and you need to check / write
the assembly yourself or maybe use some realtime POSIX extensions.
 
G

Gerhard Fiedler

Joshua said:
As I said earlier, good luck attempting to write a portable
specification to that end, and you don't really "need" it. You need
it as much as you need more than 10 bytes of stack space, and you
need it as much as you need a compiler that won't maliciously
pessimize your programs. If you need more than that, you need
something more than what a portable standard can give you, and you
need to check / write the assembly yourself or maybe use some
realtime POSIX extensions.

Check out this document
<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.9908> about
the pthread guarantees. I haven't yet read (and understood) all of it,
but it seems to me that in proves that both lock and unlock require
memory fences (albeit different ones), and that it is not allowed to
reorder memory operations across locks (and only in a specific direction
across unlocks).

I interpret this such that a compiler (that claims to implement the
pthread spec) is not allowed to remove a lock operation -- even if
there's nothing between the lock and the following unlock, and even if
it is to combine two lock/unlock pairs --, as this would remove a memory
fence and allow operations to be reordered across a code position across
which they weren't allowed to be reordered before removing the lock
operation.

Gerhard
 
J

Joshua Maurice

Check out this document
<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.9908> about
the pthread guarantees. I haven't yet read (and understood) all of it,
but it seems to me that in proves that both lock and unlock require
memory fences (albeit different ones), and that it is not allowed to
reorder memory operations across locks (and only in a specific direction
across unlocks).

I interpret this such that a compiler (that claims to implement the
pthread spec) is not allowed to remove a lock operation -- even if
there's nothing between the lock and the following unlock, and even if
it is to combine two lock/unlock pairs --, as this would remove a memory
fence and allow operations to be reordered across a code position across
which they weren't allowed to be reordered before removing the lock
operation.

At first I was slightly concerned that I didn't know what was going
on, but then I found this in the paper:

Finally, we show that the rules for reordering memory op-
erations across locking primitives are in fact different from
what we believe most implementors would have expected.
In particular, a memory operation immediately following a
pthread mutex unlock operation may always be moved to just
before the pthread mutex unlock. But the corresponding re-
ordering of a pthread mutex lock with a preceding memory op-
eration is not generally safe. More general movement into critical
sections is possible in the absence of the pthread mutex trylock
call.
As we point out in section 3, many current implementations ei-
ther do not follow these rules, or add extra fences. Nor is it com-
pletely apparent that they should, since these semantics appear to
have been largely accidental. On the other hand, an understanding
of these issues does seem to be essential to any revision of the cur-
rent rules.

Ah yes, I knew about this. Reading further confirms that the problem
is the one I'm thinking of. It's basically a bug in the POSIX pthreads
standard. The POSIX pthreads standard as written requires that a
try_lock succeed if no one else is holding the lock. This has some
rather interesting implications which the paper lays out. The
implications are that locks require double the "usually required"
fences, or other contortions such as the paper describes (specifically
disallow code motion past a lock).

Also, as the paper notes, almost all of the pthreads implementers
overlooked this bug, and if you write a conforming program that abuses
try_lock in interesting ways, it'll produce an unallowed execution on
nearly all pthreads implementations (colloquially "break").

C++0x fixes this problem by allowing try_lock to spuriously fail, that
is fail to acquire the lock even though no one else holds it. (AFAIK),
it is not the intent that implementers will make try_lock spuriously
fail, but programs must assume that it can spuriously fail. This
restricts the set of formally race free programs to ones that allow
the intended implementation of locks, and allow code motion past a
lock. Thus, C++0x avoids this nasty little problem.

Note that their "Theorem 6.6" proves that you can indeed move
operations past a lock /if/ there's no (broken) trylocks. The problems
only arise with (broken) trylocks, which C++0x fixes.

----

However, that is immaterial to the problem at hand. Even if you
couldn't do the code transformations that I was discussing due to the
particulars of the situation, you seem to have a more fundamental
problem. The fundamental problem is you don't believe in the "as if"
rule. C++ spells it out quite clearly with "C++03, 1.9 Program
execution / 1". In a C++0x draft n3126, it's still "1.9 Program
execution / 1", reproduced here:

1 The semantic descriptions in this International Standard define a
parameterized nondeterministic abstract
machine. This International Standard places no requirement on the
structure of conforming implementations.
In particular, they need not copy or emulate the structure of the
abstract machine. Rather, conforming
implementations are required to emulate (only) the observable behavior
of the abstract machine as explained
below. (5)

Footnote 5:
5) This provision is sometimes called the “as-if†rule, because an
implementation is free to disregard any requirement of this
International Standard as long as the result is as if the requirement
had been obeyed, as far as can be determined from the
observable behavior of the program. For instance, an actual
implementation need not evaluate part of an expression if it can
deduce that its value is not used and that no side effects affecting the
observable behavior of the program are produced.

Also, see C++0x draft n3126, "1.9 Program execution / 5":
5 A conforming implementation executing a well-formed program shall
produce the same observable behavior
as one of the possible executions of the corresponding instance of the
abstract machine with the same program
and the same input. However, if any such execution contains an
undefined operation, this International
Standard places no requirement on the implementation executing that
program with that input (not even
with regard to operations preceding the ï¬rst undefined operation).

I am unable to find something as explicitly clear as that in the POSIX
standards offhand, but I assume it's in there somewhere. At the very
least, I can find plenty of occurrences of the phrase "as if" in the
specs for the individual threading functions, such as in:
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html
The implementation shall behave as if at all times there is at most
one owner of any mutex.
Also, I'm pretty sure that's how all the C, C++, Java, etc.,
implementers interpret it.

This is the general spirit of all of the C++-relevant memory models,
including win32, pthreads, and C++0x. (This is also true for the Java
memory model.) It's also true for C and C++ programs in all ways - not
just threading. An implementation is free to run your program however
it wants as long as it produces one of the allowed executions. It
doesn't even need to be the same execution each time. For non-trivial
threaded programs, it's highly likely that it will be a different
execution each time.

Again, I understand your concerns, but compilers, linkers, schedulers,
hardware etc., will continue to optimize, and the rule that lets them
optimize is the "as if" rule. They optimize to make your program run
better. That's their job. Their job is to take your program, transform
the assembly of it (or whatever code they work with) to make it
better. Threads are no exception.

Sometimes the optimization is broken and produces an unallowed
execution, but that's different than what we're talking about here.
We're talking about an optimization that produces an allowed
execution, but the execution is undesirable or unacceptable to the
programmer. Such is life writing real C and C++ programs. If your
implementation ever does such a broken optimization, thus I suggest
you complain to your implementers, switch implementations, or write
the relevant parts in assembly (and hope that the problem is in the
compiler, as the linker, scheduler, hardware, etc., can still get
you).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,818
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top