Relaxed ordering is intended to be minimal overhead on all systems, so
it provides no ordering guarantees. On systems that always provide the
ordering guarantees, putting memory_order_acquire on the fetch_add is
probably minimal overhead. On systems that truly exhibit relaxed
ordering, requiring that the relaxed fetch_add participate in the
release sequence could add considerable overhead.
Consider my example above on a distributed system where the processors
are conceptually "a long way" apart, and data synchronization is
explicit.
With the current WP, processor 2 only needs to synchronize access to
y. If the relaxed op featured in the release sequence, it would need
to also handle the synchronization data for x, so that processor 3 got
the "right" values for x and y.
In your example, yes, one have to use non-relaxed rmw. But consider
following example:
struct object
{
std::atomic<int> rc;
int data;
void acquire()
{
rc.fetch_add(1, std::memory_order_relaxed);
}
void release()
{
if (1 == rc.fetch_sub(1, std::memory_order_release)
{
std::atomic_fence(std::memory_order_acquire);
data = 0;
delete this;
}
}
};
object* g_obj;
void thread1();
void thread2();
void thread3();
int main()
{
g_obj = new object;
g_obj->data = 1;
g_obj->rc = 3;
thread th1 = start_thread(&thread1);
thread th2 = start_thread(&thread2);
thread th3 = start_thread(&thread3);
join_thread(th1);
join_thread(th2);
join_thread(th3);
}
void thread1()
{
volatile int data = g_obj->data;
g_obj->release(); // T1-1
}
void thread2()
{
g_obj->acquire(); // T2-1
g_obj->release(); // T2-2
g_obj->release(); // T2-3
}
void thread3()
{
g_obj->release(); // T3-1
}
From point of view of current C++0x draft this code contains race on
g_obj->data. But I think this code is perfectly legal from hardware
point of view.
Consider following order of execution:
T1-1
T2-1 - here release sequence is broken, because of relaxed rmw
T2-2 - but here release sequence is effectively "resurrected from
dead", because thread, which executed relaxed rmw, now execute non-
relaxed rmw
T2-3
T3-1
So I think that T1-1 must 'synchronize-with' T3-1.
Formal definition is something like this:
A release sequence on an atomic object M is a maximal contiguous sub-
sequence of side effects in the modification order of M, where the
first operation is a release, and every subsequent operation
— is performed by the same thread that performed the release, or
— is a non-relaxed atomic read-modify-write operation.
— is a *relaxed* atomic read-modify-write operation.
Loaded release sequence on an atomic object M wrt evaluation A is part
of release sequence starting from the beginning and up to (inclusive)
value loaded by evaluation A.
An evaluation A that performs a release operation on an object M
synchronizes with an evaluation B that performs an acquire operation
on M and reads a value written by any side effect in the release
sequence headed by A, *if* for every relaxed rmw operation in loaded
release sequence there is subsequent non-relaxed rmw operation in
loaded release sequence executed by the same thread.
More precisely: *if* for every relaxed rmw operation (executed not by
thread which execute release)...
I'm trying to make definitions more "permissive", thus making more
correct (from hardware point of view) usage patterns legal (from C++0x
point of view).
What do you think?
Dmitriy V'jukov