question re. usage of "static" within static member functions of aclass

C

Chris M. Thomasson

[...]
DCL is the most common application I see of this flawed reasoning. The
heart of DCL is that one relies on investigating the possible
interleavings of source code "instructions" to see if it will work
out. IMHO, as soon as you modify DCL to be correct in C or C++, it is
no longer DCL.

What modifications of the DCL algorithm itself are needed in order to get it
to work in C/C++? I don't have to modify the actual algorithm in order to
get it to work in C/C++. I only have to implement the algorithm using highly
non-portable methods.



You have fundamentally changed the basis of your belief
of its correctness when you add in proper synchronization. No longer
are you doing a single check outside of all synchronization,
the hallmark of DCL. It only superficially resembles DCL.

Unless I am misunderstanding you, I have to disagree here. The algorithm
needs proper synchronization on the fast-path and on the slow-path. On the
fast-path it needs a data-dependant load memory barrier, and on the slow
path it needs the mutex along with release semantics. There is no "single
check outside of all synchronization". AFAICT, that's a misunderstanding on
how the algorithm actually works. Skipping the mutex acquisition/release is
the only thing that DCL actually optimizes. Just because you can skip a call
to the mutex (e.g., fast-path) does not mean that you're somehow "outside of
all synchronization".



Now, arguably, your post Jerry and mine is merely a definition over
terms. I hope I've generated more light than heat, but I don't think
it will be useful to get into a discussion of whether DCL is by
definition broken or merely usually incorrectly implemented. Define it
however you want. I still think that DCL by definition is incorrect in
C and C++, and modifications to make it correct render it no longer
DCL.

DCL is DCL. Even if I implement it in pure ASM, I still cannot get away from
using proper synchronization. If I implement DCL in C/C++ using
platform/compiler specific guarantees, well, it's still DCL, not something
different.
 
C

Chris M. Thomasson

[snip probably correct impl]
I should probably just shut up and take my losses. I am still right
that there is no mutex on windows without runtime init, but your
workaround, which Boost also uses to do boost_once IIRC, is quite
good.

Ahhh. I forgot that Boost does something like that. It works, but it's kind
of "hackish"...

;^)



I forget which versions, but windows also have their own native
version of pthread_once, so you could also use that.

It's on Windows Vista and higher.



Also, as you
mentioned, volatile would work, for whichever VS versions and
platforms make it into barriers.

Yes.
 
F

Francesco

On 10 Set, 13:04, James Kanze <[email protected]> wrote:

     [...]
Uh, well, I've read your code but I didn't think about the
fact that ourInstance had to be initialized before being
compared to NULL - in other words, I missed to catch the
subtlety.

You missed the fact that the function used to initialize the
variable uses the previous value of the variable.

In fact, it's a clever trick, with both the positive and
negative implications of "clever".   The suggestion else thread
of using a separate boolean variable to accomplish this achieves
exactly the same results, and is doubtlessly more readable.
Can I rely on this default, hidden initialization for static
members of any type?

For any variable.  The standard describes the initialization of
objects with static lifetime as occuring in three phases: zero
initialization, static initialization and dynamic
initialization.  In practice, there's no way to distinguish the
first two, since both occur before any code you write is
executed.  (In practice, on most systems, the first two are done
by the system, when the program is loaded, either by copying an
image from disk or by zapping the block containing the variables
with 0's.  This only works, of course, on systems where null
pointers and 0.0 are represented by all 0 bits, but that's the
vast majority of the systems today.`)
Something like this:
-------
class A {
  static int data;
  public:
    static int get_data();
};
int A::data = get_data();
int A::get_data() {
  if (data == 0) data = 42;
  return data;
}

Or using 0.  Yes.  For class types, it applies recursively,
until you reach a basic type which can be "zero initialized".
The one exception is references, since by definition, a
reference can't be zero initialized.

Good, now seems that I got it. Thanks for the further explanation
James.

Cheers,
Francesco
 
J

Jerry Coffin

On Sep 10, 4:07 pm, Jerry Coffin <[email protected]> wrote:

[ ... ]
Sorry what? There's something which resembling threading guarantees in
C++03?

No, "lacks" would mean "does not have", and almost any kind of
threading guarantee requires some kind of memory barrier.

[ ... ]
The problem is that most people I talk to at my work, they have
certain notions of how threading works. One of these notions is that
one can reason about threading by considering all possible
interleavings of instructions (which, of course, is incorrect). Sadly,
most of my colleagues still don't "get it". The proto-typical example
I use: "Suppose a thread does a write A and a write B in source code.
Physically after that processor core does both load instructions,
another thread may see write A and not write B, another thread may see
write B and not see write A, another thread may see both, and another
thread may see neither." If one was incorrectly taught threading,
which is the case I think for most university graduates nowadays, it
is very hard to break your old habits and understand how threading
really works.

I think the problem is really somewhat more fundamental: with
threading, it's almost pointless to try to figure out every scenario
that _could_ happen. Instead, you have to think strictly in terms of
what you've _made_ happen, and assume something wrong will happen
unless you've prevented it.

[ ... ]
IMHO, as soon as you modify DCL to be correct in C or C++, it is
no longer DCL. You have fundamentally changed the basis of your belief
of its correctness when you add in proper synchronization.

I can't agree. Just for an obvious example, see page 12 of the paper
you cited, which gives (most of) an implementation of the DCLP using
memory barriers to ensure correctness. Note, in particular, that it's
essentially identical to the original (incorrect) code other than the
addition of two memory barriers (added only as comments, since the
current version of C++ doesn't include them).

With C++ 0x, we'll have acquire and release memory barriers, and
we'll be able to implement exactly that code -- and I can't see any
way you can reasonably call it anything other than a DCL -- it still
has the double-check and locking, just with memory barriers inserted
to ensure that it actually works.
No longer
are you doing a single check outside of all synchronization, the
hallmark of DCL. It only superficially resembles DCL.

I'd say the hallmark of DCL is obvious from the name: double-checked
locking. As long as you still do double-checked locking, insertion of
memory barriers doesn't change the fact that you're still doing
double-checked locking, and therefore implementing the double-checked
locking pattern.
 
J

James Kanze

[ ... ]
Firstly and most importantly, you're using double checked
locking, which is broken in effectively all C++
implementations. Don't do that. Please read, continue to
re-read if you don't get it, this excellent paper:
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
This is a decent paper, but it should be kept in context. Up
until shortly before he (helped) write that paper, Andrei
seems to have thought that the 'volatile' keyword was
sufficient to give assurances necessary for multithreading (in
essence that reading or writing a volatile variable acted as a
memory barrier).

I seem to recall that it was a couple of years before the paper
in question. Andrei did (naively) propose a means of achieving
thread safety using volatile. In fact, his solution worked, but
not for the reasons he thought---his solution actually only used
the fact that volatile is part of the type system. In the
following discussions, however, he quickly realized (and openly
admitted) that his understanding wasn't complete; since then
(and before writing the paper in question), he completed it.
The paper was also thoroughly reviewed before publication, to
ensure accuracy.
That paper followed shortly after he realized that this just
wasn't so. The tone of the paper is unfortunate though -- it
comes off as basically saying there's a problem with
double-checked locking, which really isn't the case at all.

This depends on how you defined double checked locking. There
is a definite problem in the code presented by Vlissides in his
original article, and that is what most people understand by
double checked locking. IIRC, the paper in question does make
it clear that double checked locking can be made to work using
assembler (at least on most platforms) or perhaps some
additional, system specific requests (other than just mutexes),
and while I don't think the paper mentions it, it can also be
made to work using thread local storage. In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.
The problem is that C++ (up through the 2003 standard) simply
lacks memory barriers. Double-checked locking is one example
of code that _needs_ a memory barrier to work correctly -- but
it's only one example of many.

It can be made to work with thread local storage as well,
without memory barriers.

[...]
The real problem was never with the DCLP itself, but with
attempting to do multi-threaded programming without the tools
necessary for the job.

Yes. The "problem" with DCLP is in fact just a symptom of a
larger problem, of people not understanding what is and is not
guaranteed (and to a lesser degree, of people not really
understanding the costs---acquiring a non-contested mutex is
really very, very cheap, and usually not worth trying to avoid).
 
C

Chris M. Thomasson

[...]
It can be made to work with thread local storage as well,
without memory barriers.

Indeed; something along the lines of:

<pseudo-code typed in newsreader>
___________________________________________________________________
template<typename T, unsigned T_id = 0>
struct non_destroying_singleton
{
static T& instance()
{
__thread T* l_instance = NULL;

if (! l_instance)
{
static T* g_instance = NULL;

static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

lock_guard lock(g_mutex);

if (! (l_instance = g_instance))
{
l_instance = g_instance = new T();
}
}

return *l_instance;
}
};
___________________________________________________________________




No memory barriers required! Yeah!

;^)


[...]
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...

[ ... using a memory barrier ]
In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.

I have to disagree with both of these. First, a memory barrier is
quite a bit different from a mutex. Consider (for example) a store
fence. It simply says that stores from all previous instructions must
complete before any stores from subsequent instructions (and a read
barrier does the same, but for reads). It's basically equivalent to a
sequence point, but for real hardware instead of a conceptual model.

As far as cost goes: a mutex normally uses kernel data, so virtually
every operation requires a switch from user mode to kernel mode and
back. The cost for that will (of course) vary between systems, but is
almost always fairly high (figure a few thousand CPU cycles as a
reasonable minimum).

A memory barrier will typically just prevent combining a subsequent
write with a previous one. As long as there's room in the write queue
for both pieces of data, there's no cost at all. In the (normally
rare) case that the CPU's write queue is full, a subsequent write has
to wait for a previous write to complete to create an empty spot in
the write queue. Even in this worst case, it's generally going to be
around an order of magnitude faster than a switch to kernel mode and
back.
It can be made to work with thread local storage as well,
without memory barriers.

Well, yes -- poorly stated on my part. It requires _some_ sort of
explicit support for threading that's missing from the current and
previous versions of C++, but memory barriers aren't the only
possible one.

[ ... ]
Yes. The "problem" with DCLP is in fact just a symptom of a
larger problem, of people not understanding what is and is not
guaranteed (and to a lesser degree, of people not really
understanding the costs---acquiring a non-contested mutex is
really very, very cheap, and usually not worth trying to avoid).

At least under Windows, this does not fit my experience. Of course,
Windows has its own cure (sort of) for the problem -- rather than
using a mutex (with its switch to/from kernel mode) you'd usually use
a critical section instead. Entering a critical section that's not in
use really is very fast.

Then again, a critical section basically is itself just a double-
checked lock (including the necessary memory barriers). They have two
big limitations: first, unlike a normal mutex, they only work between
threads in a single process. Second, they can be quite slow when/if
there's a great deal of contention for the critical section.
 
K

Keith H Duggar

Not knowing the requirements, it's hard to say. My usual
implementation is:

class Data
{
private:
static Data* ourInstance ;
Data() {}
~Data() {}

public:
static Data& instance() ;
// ...
} ;

Data* Data:: ourInstance = &instance() ;

Data&
Data::instance()
{
if ( ourInstance == NULL ) {
ourInstance = new Data ;
}
return *ourInstance ;
}

This solves the threading issues (for the most part), and avoids
any order of destruction issues, by not destructing the object.

I think it's worth noting that the implementation above always
instantiates Data even if it is never used by another part of
the program. ie the above is not the "lazy" instantiation of
the original implementation.

KHD
 
J

Joshua Maurice

(e-mail address removed)>, (e-mail address removed)
says...

At least under Windows, this does not fit my experience. Of course,
Windows has its own cure (sort of) for the problem -- rather than
using a mutex (with its switch to/from kernel mode) you'd usually use
a critical section instead. Entering a critical section that's not in
use really is very fast.

Then again, a critical section basically is itself just a double-
checked lock (including the necessary memory barriers). They have two
big limitations: first, unlike a normal mutex, they only work between
threads in a single process. Second, they can be quite slow when/if
there's a great deal of contention for the critical section.

Well, with contention, how slow would a CriticalSection be compared to
a WIN32 Mutex object? I would presume about the same, and with that
assumption, I would say one should use CriticalSections as your mutex
implementation on windows unless you actually need the extra
functionality, like locking the same object across processes, which in
my experience is quite rare.
 
J

Jerry Coffin

[ ... ]
Well, with contention, how slow would a CriticalSection be compared
to a WIN32 Mutex object? I would presume about the same, and with
that assumption, I would say one should use CriticalSections as
your mutex implementation on windows unless you actually need the
extra functionality, like locking the same object across processes,
which in my experience is quite rare.

We're getting off topic, so I won't try to get into a lot of detail,
but with sufficient contention, a critical section can become
_substantially_ slower than direct use of a mutex.

A couple of years ago (or so) there was a thread about this on
comp.os.ms-windows.programmer.win32. IIRC, somebody eventually posted
an improved version that didn't suffer nearly as bad of a slowdown.
Unfortunately, given the degree to which Google has broken searching
on the newsgroup archive, I can't tell you exactly where to get that.
 
C

Chris M. Thomasson

James Kanze said:
And doesn't avoid the order of initialization issues the other
versions were designed to avoid.



Not knowing the requirements, it's hard to say. My usual
implementation is:

class Data
{
private:
static Data* ourInstance ;
Data() {}
~Data() {}

public:
static Data& instance() ;
// ...
} ;

Data* Data:: ourInstance = &instance() ;

Data&
Data::instance()
{
if ( ourInstance == NULL ) {
ourInstance = new Data ;
}
return *ourInstance ;
}

This solves the threading issues (for the most part), and avoids
any order of destruction issues, by not destructing the object.

FWIW, this is how to break it using threads:

http://groups.google.com/group/comp.lang.c++/msg/769cda6552764337
 
C

Chris M. Thomasson

Jerry Coffin said:
(e-mail address removed)>, (e-mail address removed)
says...

[ ... using a memory barrier ]
In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.

I have to disagree with both of these. First, a memory barrier is
quite a bit different from a mutex. Consider (for example) a store
fence. It simply says that stores from all previous instructions must
complete before any stores from subsequent instructions (and a read
barrier does the same, but for reads). It's basically equivalent to a
sequence point, but for real hardware instead of a conceptual model.


As far as cost goes: a mutex normally uses kernel data, so virtually
every operation requires a switch from user mode to kernel mode and
back.

Well, a `CRITICAL_SECTION' is a mutex. A HANDLE returned from
`CreateMutex()' is also mutex. Therefore, a mutex implementation does not
need to always enter the kernel. IMVHO, Windows made a poor choice of naming
their intra-process mutex. IMO, a "critical section" does not refer to the
synchronization primitive, but to the actual locked portion of the caller's
code.

The cost for that will (of course) vary between systems, but is
almost always fairly high (figure a few thousand CPU cycles as a
reasonable minimum).

A memory barrier will typically just prevent combining a subsequent
write with a previous one.

A #StoreLoad memory barrier can cripple performance by forcing the previous
stores to be performed before any subsequent load can be committed. This
defeats the purpose of caching and pipeline:

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce

As long as there's room in the write queue
for both pieces of data, there's no cost at all.

There is a cost when you start getting into load-to-store and store-to-load
ordering constraints. For instance, the following store membar:


MEMBAR #StoreStore



is MUCH less expensive when compared to a version which has a store-to-load
ordering constraint:

MEMBAR #StoreLoad | #StoreStore



You need a store-to-load ordering in the acquisition portion of a
traditional mutex. This is why even user-space fast-pathed mutexs can have
some nasty overheads even in the non-contended case.



In the (normally
rare) case that the CPU's write queue is full, a subsequent write has
to wait for a previous write to complete to create an empty spot in
the write queue. Even in this worst case, it's generally going to be
around an order of magnitude faster than a switch to kernel mode and
back.


Well, yes -- poorly stated on my part. It requires _some_ sort of
explicit support for threading that's missing from the current and
previous versions of C++, but memory barriers aren't the only
possible one.


At least under Windows, this does not fit my experience. Of course,
Windows has its own cure (sort of) for the problem -- rather than
using a mutex (with its switch to/from kernel mode) you'd usually use
a critical section instead.

Entering a critical section that's not in
use really is very fast.

Not in all cases... Try a test with something like the following crude
pseudo-code:
________________________________________________________________
struct data
{
int array[128];
};


static struct data g_data = { { 0 } };


void locked_reader_threads()
{
unsigned iterations;
CRITICAL_SECTION mutex;

InitializeCriticalSection(&mutex);

for (iterations = 0 ;; ++iterations)
{
EnterCriticalSection(&mutex);

LeaveCriticalSection(&mutex);

for (int i = 0; i < 128; ++i)
{
int x = g_data.array;

if (x) abort();
}
}

DeleteCriticalSection(&mutex);
}


void naked_reader_threads()
{
unsigned iterations;

for (iterations = 0 ;; ++iterations)
{
for (int i = 0; i < 128; ++i)
{
int x = g_data.array;

if (x) abort();
}
}
}
________________________________________________________________




See how many iterations each reader thread can perform per-second under
worst case load for prolonged periods of time (e.g., like sustained high
intensity bursts of traffic on a database server or something). The
`locked_reader_threads()' should perform less reads per-second-per-thread.



Then again, a critical section basically is itself just a double-
checked lock (including the necessary memory barriers).

AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive mutexs.
Could the inter-process mutexs share the same properties? I think so.



They have two
big limitations: first, unlike a normal mutex, they only work between
threads in a single process. Second, they can be quite slow when/if
there's a great deal of contention for the critical section.

WRT slow, are you referring to the old implementation of `CRITICAL_SECTION's
which used to hand off ownership? IIRC, MS changed the implementation to
allow for a thread to sneak in and acquire the mutex which increases
performance considerably. However, unlike handing off ownership, it's not
fair and can be subjected to "starvation".
 
C

Chris M. Thomasson

Jerry Coffin said:
[ ... ]
Well, with contention, how slow would a CriticalSection be compared
to a WIN32 Mutex object? I would presume about the same, and with
that assumption, I would say one should use CriticalSections as
your mutex implementation on windows unless you actually need the
extra functionality, like locking the same object across processes,
which in my experience is quite rare.

We're getting off topic, so I won't try to get into a lot of detail,
but with sufficient contention, a critical section can become
_substantially_ slower than direct use of a mutex.

Are you writing about a `CRITICAL_SECTION' implementation that hands off
ownership?
 
J

James Kanze

"James Kanze" <[email protected]> wrote in message
[...]
Not knowing the requirements, it's hard to say. My usual
implementation is:
class Data
{
private:
static Data* ourInstance ;
Data() {}
~Data() {}
public:
static Data& instance() ;
// ...
} ;
Data* Data:: ourInstance = &instance() ;
Data&
Data::instance()
{
if ( ourInstance == NULL ) {
ourInstance = new Data ;
}
return *ourInstance ;
}
This solves the threading issues (for the most part), and
avoids any order of destruction issues, by not destructing
the object.
FWIW, this is how to break it using threads:

Yes. If you start a thread which uses it from the constructor,
you will have problems. That's what the "(for the most part)"
refers to. (In my actual library, this restriction is
documented.)

In practice, I've never found this to be an issue. Nor have I
found Keith Duggar's point, that the object will always be
constructed, even if it is never used, and issue---just the
opposite, in most of my applications, start-up time is more or
less free, but I have time constraints when responding to a
request, so anything that can be moved out into the program
initialization phase is good. Still, it's a point that should
be documented (and that IIRC, I forgot to document in my generic
implementation).
 
J

James Kanze

(e-mail address removed)>,
(e-mail address removed) says...
[ ... using a memory barrier ]
In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.
I have to disagree with both of these.

You're arguing against actual measurements made on a Sun Sparc,
under Solaris.
First, a memory barrier is quite a bit different from a mutex.
Consider (for example) a store fence. It simply says that
stores from all previous instructions must complete before any
stores from subsequent instructions (and a read barrier does
the same, but for reads). It's basically equivalent to a
sequence point, but for real hardware instead of a conceptual
model.
Certainly.

As far as cost goes: a mutex normally uses kernel data,

Since when. This isn't the case on any of the systems I'm
familiar with (Solaris, Linux and Windows). In all cases, the
mutex (CriticalSection, under Windows) only goes into kernel
mode if there is a conflict. (Note that unlike the Windows
implemnentation, under Solaris or Linux, this is true even if
the mutex is shared with another process, or if there is a time
out on the wait.)
so virtually every operation requires a switch from user mode
to kernel mode and back. The cost for that will (of course)
vary between systems, but is almost always fairly high (figure
a few thousand CPU cycles as a reasonable minimum).

A few thousand CPU cycles seems quite high to me, given the
timings I've made (all under Solaris on a Sparc, some time ago),
but it is expensive, yes (a couple of hundred). That's why
Solaris and Linux avoid it, and why Windows offers a
"simplified" mutex (which they misleadingly call
CriticalSection), which avoids it.
A memory barrier will typically just prevent combining a
subsequent write with a previous one. As long as there's room
in the write queue for both pieces of data, there's no cost at
all.

A memory barrier ensures that no following operations become
visible until the previous operations are guaranteed visible.
At least on a Sparc (again, the only machine on which I've made
measurements), this can be very expensive---easily ten times the
cost of a "normal" instruction.

[...]
Well, yes -- poorly stated on my part. It requires _some_ sort
of explicit support for threading that's missing from the
current and previous versions of C++, but memory barriers
aren't the only possible one.

If you're talking about the current and previous versions of
C++, something like pthread_create formally invokes undefined
behavior, so something more is certainly necessary for
multithreaded applications. If you're talking about C++ plus
Posix or Windows, then at least under Posix (and I think
Windows), there is support for thread local storage. Given the
interface under Posix, I suspect that it can be rather expensive
to use, however (but I've not measured it), which would account
for the fact that it's not often suggested. If accessing a
thread local variable is no more expensive than accessing a
normal static variable, and each thread makes a lot of requests
to the instance function, using the thread local variable
solution could be a definite winner.
[ ... ]
Yes. The "problem" with DCLP is in fact just a symptom of a
larger problem, of people not understanding what is and is not
guaranteed (and to a lesser degree, of people not really
understanding the costs---acquiring a non-contested mutex is
really very, very cheap, and usually not worth trying to avoid).
At least under Windows, this does not fit my experience. Of
course, Windows has its own cure (sort of) for the problem --
rather than using a mutex (with its switch to/from kernel
mode) you'd usually use a critical section instead. Entering a
critical section that's not in use really is very fast.

What Windows calls a CriticalSection is, in fact, a mutex, and
is what I use under Windows when I need a mutex to protect
between threads (as opposed to between processes).

Note that the Windows implementation of boost::mutex uses
CriticalSection.
Then again, a critical section basically is itself just a
double- checked lock (including the necessary memory
barriers). They have two big limitations: first, unlike a
normal mutex, they only work between threads in a single
process. Second, they can be quite slow when/if there's a
great deal of contention for the critical section.

It there's a lot of contention, any locking mechanism will be
expensive. Between processes... The Posix mutex works between
processes, with no kernel code if there is no contention. On
the other hand (compared to Windows), it doesn't use an
identifier; the mutex object itself (pthread_mutex_t) must be in
memory mapped to both processes.
 
J

Jerry Coffin

(e-mail address removed)>,
(e-mail address removed) says...
[ ... using a memory barrier ]
In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.
I have to disagree with both of these.

You're arguing against actual measurements made on a Sun Sparc,
under Solaris.

The degree of similarity (or lack thereof) between a memory barrier
1) is entirely platform independent, and 2) isn't really open to
measurement.

Based on what you've said, it comes down to this: the platforms with
which you're familiar include a double-checked lock in their
implementation of a mutex (as long as under Windows you treat
"mutex" as meaning "critical section").

Going back to your original statement, that there's little point in
using double-checked locking, I'd certainly agree that when the
system builds a double-checked lock into its mutex (or what you use
as a mutex anyway), then there's little or no gain from duplicating
that in your own code.

[ ... ]
It there's a lot of contention, any locking mechanism will be
expensive.

Yes, but in Windows if there's a lot of contention, a critical
section is a lot _more_ expensive than a mutex.
Between processes... The Posix mutex works between
processes, with no kernel code if there is no contention. On
the other hand (compared to Windows), it doesn't use an
identifier; the mutex object itself (pthread_mutex_t) must be in
memory mapped to both processes.

In other words, as I pointed out above, it's doing a double-checked
lock. It manipulates some data directly (with appropriate memory
barriers), and iff that fails, uses the real mutex that involves a
switch to kernel mode (and allows the kernel to schedule another
thread).
 
C

Chris M. Thomasson

Jerry Coffin said:
(e-mail address removed)>,
(e-mail address removed) says...
[ ... using a memory barrier ]
In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.
I have to disagree with both of these.

You're arguing against actual measurements made on a Sun Sparc,
under Solaris.

The degree of similarity (or lack thereof) between a memory barrier
1) is entirely platform independent, and 2) isn't really open to
measurement.

Based on what you've said, it comes down to this: the platforms with
which you're familiar include a double-checked lock in their
implementation of a mutex (as long as under Windows you treat
"mutex" as meaning "critical section").

Going back to your original statement, that there's little point in
using double-checked locking, I'd certainly agree that when the
system builds a double-checked lock into its mutex (or what you use
as a mutex anyway), then there's little or no gain from duplicating
that in your own code.

FWIW, there is a fairly big difference between the fast-path on double
checked locking and the fast-path of a mutex. The check of a DCL algorithm
does not involve any atomic RMW operations. However, Petersons algorithm
aside for a moment, the check of a mutex does use atomic RMW.



[ ... ]
It there's a lot of contention, any locking mechanism will be
expensive.

Yes, but in Windows if there's a lot of contention, a critical
section is a lot _more_ expensive than a mutex.

I am not exactly sure why it would be a "lot _more_ expensive". I can see a
certain advantage, but it should not make that much of a difference. I will
post (*) a quick and dirty example program at the end of this message which
attempts to show performance difference between `CRITICAL_SECTION, `HANDLE'
mutex, and no lock at all...



In other words, as I pointed out above, it's doing a double-checked
lock. It manipulates some data directly (with appropriate memory
barriers), and iff that fails, uses the real mutex that involves a
switch to kernel mode (and allows the kernel to schedule another
thread).

The difference is that DCL does not need to manipulate/mutate data in order
to skip mutex acquisition. It only needs to perform a data-dependant load
which is a plain naked load on basically every system out there expect DEC
Alpha.









(*)
________________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <process.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>




#define READERS 4
#define ITERS 999999
#define L2DEPTH 64
#define L2CACHE 64
/* #define USE_CRITICAL_SECTION */
/* #define NO_LOCKS */




#define ALIGN(p, a) \
((((DWORD_PTR)(p)) + ((DWORD_PTR)(a)) - 1U) & \
((DWORD_PTR)(-((LONG_PTR)(a)))))


#define ALIGN_PTR(p, a) ((void*)ALIGN(p, a))


#define ALIGN_BUFSIZE(s, a) \
(((DWORD_PTR)(s)) + ((DWORD_PTR)(a)) - 1U)


#define ALIGN_CHECK(p, a) \
(! (((DWORD_PTR)(p)) % ((DWORD_PTR)(a))))




#if ! defined (NO_LOCKS)
# if defined (USE_CRITICAL_SECTION)
typedef CRITICAL_SECTION mutex_type;

# define mutex_create(s) \
InitializeCriticalSection((s))

# define mutex_destroy(s) \
InitializeCriticalSection((s))

# define mutex_lock(s) \
EnterCriticalSection((s))

# define mutex_unlock(s) \
LeaveCriticalSection((s))

# else
typedef HANDLE mutex_type;

# define mutex_create(s) \
(*(s) = CreateMutex(NULL, FALSE, NULL))

# define mutex_destroy(s) \
CloseHandle(*(s))

# define mutex_lock(s) \
WaitForSingleObject(*(s), INFINITE)

# define mutex_unlock(s) \
ReleaseMutex(*(s))
# endif


#else
typedef int mutex_type;

# define mutex_create(s) ((void)0)
# define mutex_destroy(s) ((void)0)
# define mutex_lock(s) ((void)0)
# define mutex_unlock(s) ((void)0)

#endif





struct l2cache
{
char buffer[L2CACHE];
};


struct data
{
struct l2cache array[L2DEPTH];
};


struct global
{
mutex_type mutex;
char l2pad2_1[L2CACHE - sizeof(mutex_type)];
struct data data;
LONG finish_ref;
HANDLE finished;
char l2pad2_2[L2CACHE - (sizeof(LONG) + sizeof(HANDLE))];
};




typedef char static_assert
[
! (sizeof(struct global) % L2CACHE) &&
! (sizeof(struct data) % L2CACHE) &&
sizeof(struct data) / L2CACHE == L2DEPTH
? 1 : -1
];




static char g_raw_buffer
[
ALIGN_BUFSIZE(sizeof(struct global), L2CACHE)
] = { '\0' };


static struct global* g_global = NULL;




unsigned WINAPI
reader_thread(void* state)
{
unsigned i;
struct l2cache cmp = { { '\0' } };

for (i = 0; i < ITERS; ++i)
{
unsigned d;

mutex_lock(&g_global->mutex);

for (d = 0; d < L2DEPTH; ++d)
{
if (memcmp(g_global->data.array + d,
&cmp,
sizeof(cmp)))
{
abort();
}
}

mutex_unlock(&g_global->mutex);
}

if (! InterlockedDecrement(&g_global->finish_ref))
{
SetEvent(g_global->finished);
}

return 0;
}




int main(void)
{
size_t i;
unsigned id;
double end;
clock_t start;
HANDLE tid[READERS];
unsigned long int iter_avg_per_thread, iter_avg_total;

g_global = ALIGN_PTR(g_raw_buffer, L2CACHE);

assert(ALIGN_CHECK(g_global, L2CACHE));

g_global->finished = CreateEvent(NULL, FALSE, FALSE, NULL);
g_global->finish_ref = READERS;

mutex_create(&g_global->mutex);

for (i = 0; i < READERS; ++i)
{
tid = (HANDLE)_beginthreadex(NULL,
0,
reader_thread,
NULL,
CREATE_SUSPENDED,
&id);
}

start = clock();

for (i = 0; i < READERS; ++i)
{
ResumeThread(tid);
}

WaitForSingleObject(g_global->finished, INFINITE);

end = ((double)(clock() - start)) / CLOCKS_PER_SEC;

if (end)
{
iter_avg_per_thread =
(unsigned long int)(ITERS / end);

iter_avg_total =
(unsigned long int)((ITERS * READERS) / end);
}

else
{
iter_avg_per_thread = ITERS;
iter_avg_total = ITERS * READERS;
}

for (i = 0; i < READERS; ++i)
{
WaitForSingleObject(tid, INFINITE);
CloseHandle(tid);
}

mutex_destroy(&g_global->mutex);

CloseHandle(g_global->finished);

printf("Threads: %u\n"
"Time: %.3f ms\n"
"Total Iterations Per-Second: %lu\n"
"Iterations Per-Second/Per-Thread: %lu\n",
READERS,
end * 1000.0,
iter_avg_total,
iter_avg_per_thread);

return 0;
}

________________________________________________________________________




You define `NO_LOCKS' for a lock-free version, you define
`USE_CRITICAL_SECTION' for a version using critical-sections, and undefine
`NO_LOCKS' and `USE_CRITICAL_SECTION' for the kernel mutex version. Here is
the output I get for the `CRITICAL_SECTION' version:

Threads: 4
Time: 28297.000 ms
Total Iterations Per-Second: 141357
Iterations Per-Second/Per-Thread: 35339




Here is the Kernel mutex:

Threads: 4
Time: 28078.000 ms
Total Iterations Per-Second: 142460
Iterations Per-Second/Per-Thread: 35615




Here is lock-free:

Threads: 4
Time: 11515.000 ms
Total Iterations Per-Second: 347372
Iterations Per-Second/Per-Thread: 86843




This is on XP with an old P4 3.06gz HyperThreaded processor. Anyway, I
cannot see that much of a difference between the two mutex based versions.
However, I can see a major difference in the lock-free one! This does not
really prove anything, but it's a little bit interesting.

;^)
 
J

Jerry Coffin

[email protected] says... said:
Well, a `CRITICAL_SECTION' is a mutex. A HANDLE returned from
`CreateMutex()' is also mutex. Therefore, a mutex implementation does not
need to always enter the kernel. IMVHO, Windows made a poor choice of naming
their intra-process mutex.

Oddly, it's a name that was inherited from OS/2. One of the real
strengths of OS/2's APIs (in general) was a rather nice naming
convention.
IMO, a "critical section" does not refer to the
synchronization primitive, but to the actual locked portion of the caller's
code.

I'd agree with that -- and EnterCriticalSection and
LeaveCriticalSection seem to reflect that intent as well.

[ ... ]
A #StoreLoad memory barrier can cripple performance by forcing the
previous stores to be performed before any subsequent load can be
committed. This defeats the purpose of caching and pipeline:

That's not (usually) entirely true. A typical processor maintains
coherency between the cache and the memory -- i.e. anything that's
been written to the cache is considered visible throughout the
system.

The memory barrier normally affects the CPUs write queue between the
CPU proper, and the cache (specifically, the first level of cache
that isn't write-through). If the writes involved happened to be to
areas of memory that are uncachable, then you'd see a really serious
performance hit. Otherwise, while there will be some performance hit,
it'll typically be _fairly_ small (the write queue is rarely more
than about 10 deep, and can typically write one entry something like
every clock or two.

Of course, if you have a cache that doesn't do snooping to ensure
that everything written to the cache is immediately visible to the
rest of the system, things get uglier in a hurry...

[ ... ]
AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive mutexs.
Could the inter-process mutexs share the same properties? I think so.

Yes, it could. It would require putting the data used for the fast-
path lock into memory that's visible to all processes involved.

[ ... ]
WRT slow, are you referring to the old implementation of
`CRITICAL_SECTION's which used to hand off ownership? IIRC, MS
changed the implementation to allow for a thread to sneak in and
acquire the mutex which increases performance considerably.
However, unlike handing off ownership, it's not fair and can be
subjected to "starvation".

Well, I haven't looked at it again recently to check, so things could
have changed since then. What I'm talking about, however, was where
EnterCriticalSection would basically do a spin-lock, repeatedly
trying to acquire the mutex, and only after several attempts would it
give up and let another thread run.
 
C

Chris M. Thomasson

Jerry Coffin said:
(e-mail address removed) says... [...]

A #StoreLoad memory barrier can cripple performance by forcing the
previous stores to be performed before any subsequent load can be
committed. This defeats the purpose of caching and pipeline:

That's not (usually) entirely true. A typical processor maintains
coherency between the cache and the memory -- i.e. anything that's
been written to the cache is considered visible throughout the
system.

The memory barrier normally affects the CPUs write queue between the
CPU proper, and the cache (specifically, the first level of cache
that isn't write-through). If the writes involved happened to be to
areas of memory that are uncachable, then you'd see a really serious
performance hit. Otherwise, while there will be some performance hit,
it'll typically be _fairly_ small (the write queue is rarely more
than about 10 deep, and can typically write one entry something like
every clock or two.

Of course, if you have a cache that doesn't do snooping to ensure
that everything written to the cache is immediately visible to the
rest of the system, things get uglier in a hurry...

FWIW, I know that a `#StoreLoad' membar tanks performance in certain
algorithms. Take plain SMR (Safe Memory Reclamation) hazard pointers. The
algorithm requires a `#StoreLoad' on every store into a hazard pointer. This
really shows up when one it iterating through a linked data-structure. I
have included a crude little program at the end of this message which
attempts to demonstrate the performance difference between a membar based
SMR and a naked version.



AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive
mutexs.
Could the inter-process mutexs share the same properties? I think so.

Yes, it could. It would require putting the data used for the fast-
path lock into memory that's visible to all processes involved.
Indeed.



[...]
WRT slow, are you referring to the old implementation of
`CRITICAL_SECTION's which used to hand off ownership? IIRC, MS
changed the implementation to allow for a thread to sneak in and
acquire the mutex which increases performance considerably.
However, unlike handing off ownership, it's not fair and can be
subjected to "starvation".

Well, I haven't looked at it again recently to check, so things could
have changed since then. What I'm talking about, however, was where
EnterCriticalSection would basically do a spin-lock, repeatedly
trying to acquire the mutex, and only after several attempts would it
give up and let another thread run.

Yeah. `CRITICAL_SECTION's are adaptive. However, what about:

SetCriticalSectionSpinCount(&mutex, 0);

?


lol... ;^)








Here is a crappy little test program for GCC and ia-32:
______________________________________________________________________
#include <time.h>
#include <stdio.h>
#include <pthread.h>




#define MFENCE() __asm__ __volatile__ ("MFENCE" : : : "memory")




#define RUNS 2
#define DEPTH 256
#define ITERS 1000000
#define READERS 4




struct node
{
struct node* volatile next;
};


static struct node g_nodes[DEPTH];
static struct node* volatile g_head = NULL;
static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;




void
node_init(void)
{
unsigned i;

for (i = 0; i < DEPTH; ++i)
{
g_nodes.next = g_head;
g_head = g_nodes + i;
}
}




void*
reader_naked(void* state)
{
unsigned i;

for (i = 0; i < ITERS; ++i)
{
struct node* n = g_head;

while (n)
{
n = n->next;
}
}

return 0;
}


void*
reader_plain_smr(void* state)
{
unsigned i;

for (i = 0; i < ITERS; ++i)
{
struct node* n = g_head;
MFENCE();

while (n)
{
n = n->next;
MFENCE();
}
}

return 0;
}


void*
reader_locked(void* state)
{
unsigned i;

for (i = 0; i < ITERS; ++i)
{
struct node* n;

pthread_mutex_lock(&g_mutex);

n = g_head;

while (n)
{
n = n->next;
}

pthread_mutex_unlock(&g_mutex);
}

return 0;
}




int
main(void)
{
unsigned i, r;
pthread_t tid[READERS];
clock_t start;
double end;

node_init();


for (r = 0; r < RUNS; ++r)
{
printf("executing test run %d of %d...\n",
r + 1,
RUNS);




/* Naked */
start = clock();

for (i = 0; i < READERS; ++i)
{
pthread_create(tid + i, NULL, reader_naked, NULL);
}

for (i = 0; i < READERS; ++i)
{
pthread_join(tid, NULL);
}

end = (double)(clock() - start) / CLOCKS_PER_SEC;

printf("reader_naked - %.3f ms / %lu reads per-second/per-thread\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));




/* Plain SMR */
start = clock();

for (i = 0; i < READERS; ++i)
{
pthread_create(tid + i, NULL, reader_plain_smr, NULL);
}

for (i = 0; i < READERS; ++i)
{
pthread_join(tid, NULL);
}

end = (double)(clock() - start) / CLOCKS_PER_SEC;

printf("reader_plain_smr - %.3f ms / %lu reads
per-second/per-thread\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));




/* Locked */
start = clock();

for (i = 0; i < READERS; ++i)
{
pthread_create(tid + i, NULL, reader_locked, NULL);
}

for (i = 0; i < READERS; ++i)
{
pthread_join(tid, NULL);
}

end = (double)(clock() - start) / CLOCKS_PER_SEC;

printf("reader_locked - %.3f ms / %lu reads
per-second/per-thread\n\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));
}

puts("\n\ncompleted!");

return 0;
}

______________________________________________________________________




Here is the output I get on my old P4 3.06gz hyperthreaded processor:
______________________________________________________________________
executing test run 1 of 2...
reader_naked - 546.000 ms / 1831501 reads per-second/per-thread
reader_plain_smr - 18313.000 ms / 54606 reads per-second/per-thread
reader_locked - 3484.000 ms / 287026 reads per-second/per-thread

executing test run 2 of 2...
reader_naked - 563.000 ms / 1776198 reads per-second/per-thread
reader_plain_smr - 18312.000 ms / 54608 reads per-second/per-thread
reader_locked - 3547.000 ms / 281928 reads per-second/per-thread



completed!
______________________________________________________________________




The only difference between `reader_naked()' and `reader_plain_smr()' is
that the latter has the same memory barrier overhead that a plain SMR
implementation would have. I can say that the `MFENCE' instruction on IA-32
can really devastate performance. I have seen too many examples and
measurements. BTW, you will get similar performance numbers on SPARC or
PowerPC. The locked version beats the plain SMR version because, blocking
aside for a moment, it amortizes memory barrier usage across the depth of
the iteration...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top