Singleton_pattern and Thread Safety

J

Joshua Maurice

I am aware of double checked locking pattern yes and this is not the
double checked locking pattern (there is only one check of the pointer
if you look).  If a pointer read/write is atomic is should be fine (on
the implementation I use it is at least).

You've hidden the second check with the static keyword.

Example: Consider:

SomeType& foo()
{
static SomeType foo;
return foo;
}

For a C++03 implementation, it's likely implemented with something
like:

SomeType& foo()
{
static bool b = false; /*done before any runtime execution, stored
in the executable image */
static char alignedStorage[sizeof(SomeType)]; /*with some magic
for alignment */
if ( ! b)
new (alignedStorage) SomeType();
return * reinterpret_cast<SomeType*>(alignedStorage);
}

That's your double check.

For C++0x, it will not be implemented like that. Instead, it will be
implemented in a thread-safe way that makes your example entirely
redundant.
 
J

Joshua Maurice

On 11/12/2010 03:12, Joshua Maurice wrote:
I am aware of double checked locking pattern yes and this is not the
double checked locking pattern (there is only one check of the pointer
if you look).  If a pointer read/write is atomic is should be fine (on
the implementation I use it is at least).

You've hidden the second check with the static keyword.

Example: Consider:

  SomeType& foo()
  {
    static SomeType foo;
    return foo;
  }

For a C++03 implementation, it's likely implemented with something
like:

  SomeType& foo()
  {
    static bool b = false; /*done before any runtime execution, stored
in the executable image */
    static char alignedStorage[sizeof(SomeType)]; /*with some magic
for alignment */
    if ( ! b)
      new (alignedStorage) SomeType();
    return * reinterpret_cast<SomeType*>(alignedStorage);
  }

That's your double check.

For C++0x, it will not be implemented like that. Instead, it will be
implemented in a thread-safe way that makes your example entirely
redundant.

Err, that should be:

SomeType& foo()
{
static bool b = false; /*done before any runtime execution, stored
in the executable image */
static char alignedStorage[sizeof(SomeType)]; /*with some magic
for alignment */
if ( ! b)
{
new (alignedStorage) SomeType();
b = true;
}
return * reinterpret_cast<SomeType*>(alignedStorage);
}
 
J

Joshua Maurice

The problem with the traditional double checked locking pattern is twofold:

1) The "checks" are straight pointer comparisons and for the second
check the pointer may not be re-read after the first check due to
compiler optimization.
2) The initialization of the pointer may be re-ordered by the CPU to
happen before the initialization of the singleton object is complete.

I think you are confusing the checking issue.  I am acquiring a lock
before this hidden check of which you speak is made and this check is
not the same as the initial fast pointer check so issue 1 is not a problem.

As far as issue 2 is concerned my version (on VC++ at least) is solved
via my lock primitive which should emit a barrier on RAII construction
and destruction and cause VC++ *compiler* to not re-order stores across
a library I/O call (if I am wrong about this a liberal sprinkling of
volatile would solve it).

I should have stated in the original post that my solution is not
portable as-is but it is a solution for a particular implementation
(which doesn't preclude porting to other implementations). :)

First, you are incorrect about the issues of double checked locking,
and threading in general. I again suggest that you read:
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
You are ignoring hardware cache issues and single pipeline reordering
issues.

Also, it may work on the current x86_32 processor, but it probably (?)
won't work on windows on all current available and future hardware.
 
C

Chris M. Thomasson

[...]
Hmm, I think I see why I might need the first barrier: is it due to loads
being made from the singleton object before the pointer check causing
problems for *clients* of the function? any threading experts care to
explain?

http://lwn.net/Articles/5159

http://mirror.linux.org.au/linux-mandocs/2.6.4-cset-20040312_2111/read_barrier_depends.html

http://groups.google.com/group/comp.lang.c++.moderated/msg/e500c3b8b6254f35

Basically, the only architecture out there which requires a data-dependant
acquire barrier after the initial atomic load of the shared instance pointer
is a DEC Alpha...


BTW, there is a way to provide DCL along with allowing for the destruction
of the singleton. The simplest method is to have the singleton return an
smart pointer that has the strong-thread safety guarantee. Here is one such
pointer implementation:

http://atomic-ptr-plus.sourceforge.net


Here is another experimental one:

http://webpages.charter.net/appcore/vzoom/refcount



Keep in mind that `boost/std::smart_ptr' does not provide strong-thread
safety guarantee...
 
I

Ian Collins

I think of cowboy as being positive. Cowboys were/are hard-working,
practical guys.

Google "cowboy builder" for the interpretation from the other side of
the pond!
 
C

Chris M. Thomasson

Leigh Johnston said:
[...]
Hmm, I think I see why I might need the first barrier: is it due to
loads
being made from the singleton object before the pointer check causing
problems for *clients* of the function? any threading experts care to
explain?

http://lwn.net/Articles/5159

http://mirror.linux.org.au/linux-mandocs/2.6.4-cset-20040312_2111/read_barrier_depends.html

http://groups.google.com/group/comp.lang.c++.moderated/msg/e500c3b8b6254f35

Basically, the only architecture out there which requires a
data-dependant
acquire barrier after the initial atomic load of the shared instance
pointer
is a DEC Alpha...
[...]

Thanks, so in summary my version should work on my implementation
(IA-32/VC++) and probably would work on other implementations except DEC
Alpha for which an extra barrier would be required.

Are you referring to this one:

http://groups.google.com/group/comp.lang.c++/msg/547148077c2245e2


I believe I have kind of solved the problem. I definitely see what you are
doing here and agree that you can get a sort of "portable" acquire/release
memory barriers by using locks. However, the lock portion of a mutex only
has to contain an acquire barrier, which happens to be the _wrong_ type for
producing an object. I am referring to the following snippet of your code:

<Leigh Johnston thread-safe version of Meyers singleton>
___________________________________________________________
static T& instance()
{
00: if (sInstancePtr != 0)
01: return static_cast<T&>(*sInstancePtr);
02: { // locked scope
03: lib::lock lock1(sLock);
04: static T sInstance;
05: { // locked scope
06: lib::lock lock2(sLock); // second lock should emit memory
barrier here
07: sInstancePtr = &sInstance;
08: }
09: }
10: return static_cast<T&>(*sInstancePtr);
}
___________________________________________________________




Line `06' does not produce the correct memory barrier. Instead, you can try
something like this:


<pseudo-code and exception saftey aside for a moment>
___________________________________________________________
struct thread
{
pthread_mutex_t m_acquire;
pthread_mutex_t m_release;


virtual void user_thread_entry() = 0;


static void thread_entry_stub(void* x)
{
thread* const self = static_cast<thread*>(x);
pthread_mutex_lock(&m_release);
self->user_thread_entry();
pthread_mutex_unlock(&m_release);
}
};


template<typename T>
T& meyers_singleton()
{
static T* g_global = NULL;
T* local = ATOMIC_LOAD_DEPENDS(&g_global);

if (! local)
{
static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;
thread* const self_thread = pthread_get_specific(...);

pthread_mutex_lock(&g_mutex);
00: static T g_instance;

// simulated memory release barrier
01: pthread_mutex_lock(&self_thread->m_acquire);
02: pthread_mutex_unlock(&self_thread->m_release);
03: pthread_mutex_lock(&self_thread->m_release);
04: pthread_mutex_unlock(&self_thread->m_acquire);

// atomically produce the object
05: ATOMIC_STORE_NAKED(&g_global, &g_instance);
06: local = &g_instance;

pthread_mutex_unlock(&g_mutex);
}

return local;
}
___________________________________________________________


The code "should work" under very many existing POSIX implementations. Here
is why...


- The implied release barrier contained in line `02' cannot rise above line
`01'.

- The implied release barrier in line `02' cannot sink below line `03'.

- Line `04' cannot rise above line `03'.

- Line '03' cannot sink below line `04'.

- Line `00' cannot sink below line `02'.

- Lines `05, 06' cannot rise above line `03'.


Therefore the implied release barrier contained in line `02' will always
execute _after_ line `00' and _before_ lines `05, 06'.




Keep in mind that there are some fairly clever mutex implementations that do
not necessarily have to execute any memory barriers for a lock/unlock pair.
Think exotic asymmetric mutex impl. I have not seen any in POSIX
implementations yet, but I have seen them used for implementing internals of
a Java VM...


[...]
 
C

Chris M. Thomasson

Leigh Johnston said:
On 11/12/2010 18:47, Chris M. Thomasson wrote: [...]

Thanks for the info. At the moment I am only concerned with IA-32/VC++
implementation which should be safe.

FWIW, an atomic store on IA-32 has implied release memory barrier semantics.
Also, an atomic load has implied acquire semantics. All LOCK'ed atomic RMW
operations basically have implied full memory barrier semantics:

http://www.intel.com/Assets/PDF/manual/253668.pdf
(read chapter 8)



Also, latest VC++ provides acquire/release for volatile load/store
respectively:

http://msdn.microsoft.com/en-us/library/12a04hfd(v=VS.100).aspx
(read all)



So, even if you port over to VC++ for X-BOX (e.g., PowerPC), you will get
correct behavior as well.


Therefore, I don't think you even need the second lock at all. If you are
using VC++ you can get away with marking the global instance pointer
variable as being volatile. This will give release semantics when you store
to it, and acquire when you load from it on Windows or X-BOX, Itanium...


[...]
 
J

Joshua Maurice

On 11/12/2010 18:47, Chris M. Thomasson wrote: [...]

Thanks for the info.  At the moment I am only concerned with IA-32/VC++
implementation which should be safe.

FWIW, an atomic store on IA-32 has implied release memory barrier semantics.
Also, an atomic load has implied acquire semantics. All LOCK'ed atomic RMW
operations basically have implied full memory barrier semantics:

http://www.intel.com/Assets/PDF/manual/253668.pdf
(read chapter 8)

Also, latest VC++ provides acquire/release for volatile load/store
respectively:

http://msdn.microsoft.com/en-us/library/12a04hfd(v=VS.100).aspx
(read all)

So, even if you port over to VC++ for X-BOX (e.g., PowerPC), you will get
correct behavior as well.

Therefore, I don't think you even need the second lock at all. If you are
using VC++ you can get away with marking the global instance pointer
variable as being volatile. This will give release semantics when you store
to it, and acquire when you load from it on Windows or X-BOX, Itanium...

Or, you know, you could just do it "the right way" the first time and
put in all of the correct memory barriers to avoid undefined behavior
according to the C++ standard in order to well, avoid, undefined
behavior. It's not like it will actually put in a useless no-op when
using the appropriate C++0x atomics as a normal load on that
architecture apparently has all of the desired semantics. Why write
unportable code which you have to read arch manuals to prove its
correctness when you can write portable code which you can prove its
correctness from the much simpler C++0x standard?

Moreover, are you ready to say that you can foresee all possible
compiler, linker, hardware, etc., optimizations in the future which
might not exist yet, and you know that they won't break the code?
Sure, the resultant assembly output is correct at the moment according
to the x86 assembly docs, but that is no guarantee that the C++
compiler will produce that correct assembly in the future. It could
implement cool optimizations that would break the /already broken/ C++
code. This is why you write to the appropriate standard. When in C++
land, write to the C++ standard.

I strongly disagree with your implications Chris that Leigh is using
good practice with his threading nonsense non-portable hacks,
especially if/when C++0x comes out and is well supported.

PS: If you are implementing a portable threading library, then
eventually someone has to use the non-portable hardware specifics.
However, only that person / library should have to, not the writer of
what should be portable general purpose code.

PPS: volatile has no place in portable code as a threading primitive
in C or C++. None. It never has. Please stop perpetuating this myth.
 
Ö

Öö Tiib

[...]

How there can be such an heated discussion about singleton anti-
pattern? If several singletons get constructed during concurrent
grabbing of instance() then destroy them until there remains one. Or
better none. No harm made since singletons are trash anyway.
 
J

Joshua Maurice

[...]

How there can be such an heated discussion about singleton anti-
pattern? If several singletons get constructed during concurrent
grabbing of instance() then destroy them until there remains one. Or
better none. No harm made since singletons are trash anyway.

I haven't been talking about singletons. I've been having a heated
discussion over incorrect, or at least non-portable bad-style,
threading code.
 
C

Chris M. Thomasson

Chris M. Thomasson said:
template<typename T>
T& meyers_singleton()
{
static T* g_global = NULL;
T* local = ATOMIC_LOAD_DEPENDS(&g_global);

if (! local)
{
static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;
thread* const self_thread = pthread_get_specific(...);

pthread_mutex_lock(&g_mutex);
00: static T g_instance;

// simulated memory release barrier
01: pthread_mutex_lock(&self_thread->m_acquire);
02: pthread_mutex_unlock(&self_thread->m_release);
03: pthread_mutex_lock(&self_thread->m_release);
04: pthread_mutex_unlock(&self_thread->m_acquire);


here is a simplification:


// simulated memory release barrier
pthread_mutex_unlock(&self_thread->m_release);
pthread_mutex_lock(&self_thread->m_release);


No code below the lock can rise above it, and no code above the unlock can
sink below it.

This will produce proper release barrier on many existing POSIX
implementations for a plurality of architectures. I cannot think of a
PThread mutex implementation that gets around all explicit membars in the
internal `pthread_mutex_lock/unlock' code.


I do know of some Java proposals...
 
J

Joshua Maurice

On 11/12/2010 18:47, Chris M. Thomasson wrote:
[...]
How there can be such an heated discussion about singleton anti-
pattern? If several singletons get constructed during concurrent
grabbing of instance() then destroy them until there remains one. Or
better none. No harm made since singletons are trash anyway.
I haven't been talking about singletons. I've been having a heated
discussion over incorrect, or at least non-portable bad-style,
threading code.

Bullshit.  I never said it was portable.  It works for me on the
implementation I use.  What is bad-style is Kanze's leaky, broken
(multiple TU) singleton method.  What is bad-style is designing classes
without any consideration of object destruction.

One can instantiate multiple "Meyers Singletons" before creating any
threads to avoid any singleton related threading code.  No object
destruction problems.

You continue to assert that a memory leak is bad a priori. The rest of
us in this thread disagree with that claim. Instead, we program to
tangible requirements, such as cost, time to market, meets (business)
use case. We also keep in mind less tangible but still important
considerations, like maintainability and reusability.

"No memory leaks" in the sense you're using has never been a
(business) use case for any of my projects. However, arguably, you
have a good point with regards to maintainability and reusability
w.r.t. the use case of repeated loading and unloading of the same DLL
in the same process.

With that out of the way, let's go back to the threading issue. I
never claimed that you claimed that it was portable. I said the code
is very badly written because writing the threading code the correct
way:
1- carries no additional runtime cost,
2- has less time to code (as you don't need to whip out an x86
assembly manual),
3- let's others easily verify your code's correctness (as it doesn't
require them to whip out an x86 assembly manual),
4- better guarantees that future compiler or hardware upgrades won't
break your program,
5- gives portability,
6- and finally makes you look like not someone who writes bad code.
 
C

Chris M. Thomasson

[...]
On 11/12/2010 18:47, Chris M. Thomasson wrote:
[...]
FWIW, an atomic store on IA-32 has implied release memory barrier
semantics.
Also, an atomic load has implied acquire semantics. All LOCK'ed atomic
RMW
operations basically have implied full memory barrier semantics:
[...]
Also, latest VC++ provides acquire/release for volatile load/store
respectively:

http://msdn.microsoft.com/en-us/library/12a04hfd(v=VS.100).aspx
(read all)

So, even if you port over to VC++ for X-BOX (e.g., PowerPC), you will
get
correct behavior as well.

Therefore, I don't think you even need the second lock at all. If you
are
using VC++ you can get away with marking the global instance pointer
variable as being volatile. This will give release semantics when you
store
to it, and acquire when you load from it on Windows or X-BOX, Itanium...
Or, you know, you could just do it "the right way" the first time and
put in all of the correct memory barriers to avoid undefined behavior
according to the C++ standard in order to well, avoid, undefined
behavior.

Using C++ with multiple threads is already undefined behavior by default
because the C++ standard does not know anything about threading. BTW,
besides using C and 100% pure POSIX Threads, what exactlyis "the right
way" to implement advanced synchronization primitives?

I personally choose to implement all of my sensitive concurrent algorithms
in assembly language. I try to contain the implementation code in externally
assembled object files that provide an external common C API. This is
"fairly safe", because "most" compilers treat calls to unknown external
functions as a sort of implicit "compiler barrier". However, aggressive
link-time optimizations have the ability to mess things up. Luckily, most
compilers that provide such features also have a way to turn them on, or
off.

Yes its like juggling chainsaws! Hopefully, the upcoming C++ standard is
going to keep the chance of decapitation down to a minimum...

;^)



It's not like it will actually put in a useless no-op when
using the appropriate C++0x atomics as a normal load on that
architecture apparently has all of the desired semantics. Why write
unportable code which you have to read arch manuals to prove its
correctness when you can write portable code which you can prove its
correctness from the much simpler C++0x standard?

What C++0x standard? Is it totally completed yet? How many compilers support
all of the functionality?

Well, I need to write code that works now. Unfortunately, this means I have
to implement non-portable architecture/platform specific code and abstract
it away under a common API.

C++0x is going to make things oh SO much easier!

:^D



Moreover, are you ready to say that you can foresee all possible
compiler, linker, hardware, etc., optimizations in the future which
might not exist yet, and you know that they won't break the code?

Hell no. However, I can document that my code works on existing compilers
and architecture combinations, and add all the caveats. For instance, I have
to document that link-time optimizations should probably be turned off for
any code that makes use of my synchronization primitives.



Sure, the resultant assembly output is correct at the moment according
to the x86 assembly docs, but that is no guarantee that the C++
compiler will produce that correct assembly in the future. It could
implement cool optimizations that would break the /already broken/ C++
code. This is why you write to the appropriate standard. When in C++
land, write to the C++ standard.

There is no threading in C++ standard. Heck, I think its even undefined
behavior to use POSIX Threads with C++; does a thread cancellation run
dtors?



I strongly disagree with your implications Chris that Leigh is using
good practice with his threading nonsense non-portable hacks,
especially if/when C++0x comes out and is well supported.

If Leigh sticks with the most current MSVC++ versions, he should be just
fine. That compiler happens to add extra semantics to `volatile'. So, if you
only use MSVC++, then you CAN use `volatile' for implementing threading
constructs. And the code will be portable to any architecture that the most
current version of MSVC++ happens to support, IA-32/64, PowerPC, and
Itanium, to name a few... For instance, MSVC++ does not emit any memory
barriers for volatile loads/stores on IA-32/64. However, it MUST emit the
proper barriers on PowerPC and Itanium.



PS: If you are implementing a portable threading library, then
eventually someone has to use the non-portable hardware specifics.
However, only that person / library should have to, not the writer of
what should be portable general purpose code.

Totally agreed! I have to write architecture/platform specific code and
abstract it away. The users only need to work with the _API_. No need to
worry about how it's actually implemented under the covers. You should
probably read the following post, and the entire thread when you get the
time:

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/29ea516c5581240e



FWIW, here is an example code for an atomic reference counted pointer with
strong-thread safety guarantee:


http://webpages.charter.net/appcore/vzoom/refcount/


I can port this to basically any IA-32 based operation system that has a GCC
compiler. If I want to port it to PPC, well, I need to create a code in PPC
assembly language, make it adhere to the common API, and BAM! We are in
business. For some reason this website comes to mind:


http://predef.sourceforge.net/


That information is oh so EXTREMELY __valuable__ to me!!!



PPS: volatile has no place in portable code as a threading primitive
in C or C++. None. It never has. Please stop perpetuating this myth.

It definitely has a place in most recent versions of MSVC++.


Basically, the only way that you can get 100% portable threading right now
is to use PThreads and compilers that comply with POSIX standard. Read this
for more info on how a compiler, GCC of all things, can break POSIX code:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/63f6360d939612b3

Please read the whole thread; David Butenhof has some very interesting posts
in there!


:^)
 
C

Chris M. Thomasson

Chris M. Thomasson said:
news:784150c0-0156-4685-9aa7-66559383960e@a28g2000prb.googlegroups.com... [...]
PS: If you are implementing a portable threading library, then
eventually someone has to use the non-portable hardware specifics.
However, only that person / library should have to, not the writer of
what should be portable general purpose code.

Totally agreed! I have to write architecture/platform specific code and
abstract it away. The users only need to work with the _API_. No need to
worry about how it's actually implemented under the covers. You should
probably read the following post, and the entire thread when you get the
time:

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/29ea516c5581240e

BTW, I was posting as `SenderX' at the time...

;^)

[...]
 
J

Joshua Maurice

As Chris pointed out the only problem with my version compared to the
version given in document by Meyers and Alexandrescu that you seem so
fond of is the lack of a memory barrier after the initial fast check but
this is only a problem for a minimal number of CPUs as the load is
dependent.  If I had to port my code to run on such CPUs I simply have
to add this extra barrier.

You didn't address one of my bigger concerns. If/when C++0x becomes in
wide use, it's possible that compilers will start optimizing more
aggressively, and without the memory barrier, it's possible that it
could break. You have shown that the current C++ undefined behavior
produces correct assembly. That's one possible outcome of undefined
behavior. A compiler upgrade could break it.

Also, if you consider it acceptable to write broken non-portable code
when portable code is better in every respect, good for you.
In the real world people write non-portable code all the time as doing
so is not "incorrect".

They don't in my workplace.
 
C

Chris M. Thomasson

[...]
You didn't address one of my bigger concerns. If/when C++0x becomes in
wide use, it's possible that compilers will start optimizing more
aggressively, and without the memory barrier, it's possible that it
could break.

How would you add the memory barrier now without resulting to an abstraction
that is based on 100% non-portable architecture specific code? What I would
personally do is create a low-level standard interface for each arch, say
something like:

extern void ia32_membar_acquire_depends(void);
extern void ia64_membar_acquire_depends(void);
extern void ppc_membar_acquire_depends(void);
extern void alpha_membar_acquire_depends(void);



The functions above would each be implemented in architecture specific
assembly language, and externally assembled into separate object files. Only
`alpha_membar_acquire_depends()' would contain the actual memory barrier.
All the others would boil down to NOPS. Then, I would use compile-time
pre-defined pre-processor macros to select the correct version, perhaps
something like:

#if defined (ARCH_DEC_ALPHA)
# defined membar_acquire_depends alpha_membar_acquire_depends
#elif defined (ARCH_IA32)
# defined membar_acquire_depends ia32_membar_acquire_depends
#elif defined (ARCH_...)

/* blah, blah */

#else
# error "Sorry, my library is not compatible with this environment!"
#endif


This is why the information contained within the wonderful:

http://predef.sourceforge.net

project is so darn important to me.



You have shown that the current C++ undefined behavior
produces correct assembly. That's one possible outcome of undefined
behavior. A compiler upgrade could break it.

Link-time optimization can possibly break it. Also, some compilers require
you to explicitly insert compiler barriers in order for them to not re-order
code. However, luckily, link-time optimization aside for a moment... Most
compilers will indeed treat calls to completely unknown externally assembled
functions as "compiler-barriers".

However, AFAICT, Leigh Johnston is only using MSVC++. This compiler has
explicit documentation that gives certain guarantees to `volatile'. It has
compiler intrinsic for compiler barriers, memory barriers, atomic
operations, ect... So, if he sticks with that specific compiler, then
everything should be just fine.
 
C

Chris M. Thomasson

[...]
Basically, the only way that you can get 100% portable threading right now
is to use PThreads and compilers that comply with POSIX standard. Read
this
for more info on how a compiler, GCC of all things, can break POSIX code:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/63f6360d939612b3

Please read the whole thread; David Butenhof has some very interesting
posts
in there!

Here is one of those posts:

http://groups.google.com/group/comp.programming.threads/msg/729f412608a8570d
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,144
Messages
2,570,823
Members
47,369
Latest member
FTMZ

Latest Threads

Top