std::deque Thread Saftey Situtation

N

NvrBst

No - you are not correct. Imagine a situation where the pushing thread
pushes the object to the dequeue and then increases the counter.
However, the size get written through the cache whereas the object
stays (partly) in the cache. Now the reader will see that there is an
object in the dequeue, but in reality it was never written to a
location that allows the other thread to see it (it is on another
core), so it reads rubbish where the object should be.

I don't quite understand what you say ("on another core" part), but
I'm pretty sure it is safe. The pop'ing thread is only taking the
first element, and the push'ing thread is appending to the end.

The size isn't incremented untill after the object is written, so when
the poping thread reads 1 element that element definally exsists (it
can't be in cache and not already written to the internal array).
Furthermore, there is a "version" number incremented internally
whenever a change is made; this forces the cache to re-validate an
element. If the size reads 0 then there can't be any problem, and if
the size reads 2+ then there can't be any problem with the first
element, so when the size reads "1" should be the only possible error
case; which, as stated, item is avaible before size is 1, and version
is incremented as well.

No way the popping thread can atempt to access the 1st element when it
actually isn't avaliable because it'd read size as "0". I think what
you were talking about is that an old cached _array[0] was there that
was poped, however, "push thread" added a new element, incremented
size, then pop thread took the cached "_array[0]". Again this
situation can't happen because the version is incremented and any
request for _array[0] would be re-validated when it seen version
doesn't match.

There might be a possible problem if the capacity is reached and it
atempts to auto incremented the internal array, however, as long as
your push thread checks capacity before adding, there shouldn't be a
problem.
 
J

James Kanze

I don't quite understand what you say ("on another core"
part), but I'm pretty sure it is safe. The pop'ing thread is
only taking the first element, and the push'ing thread is
appending to the end.

I don't have the original code handy to verify, but what you
describe above is only safe if all accesses are synchronized
somehow.
The size isn't incremented until after the object is written,

In the thread which is pushing. Unless specific precautions are
taken (external synchronization at the system level, or fence or
membar instructions), different threads are not guaranteed to
see memory writes in the same order.
 
Z

zaimoni

I don't quite understand what you say ("on another core" part), but
I'm pretty sure it is safe. The pop'ing thread is only taking the
first element, and the push'ing thread is appending to the end.

It's absolutely safe only if the std::deque has been specially
implemented so that each member function has "enough"
synchronization. This would be a vendor extension. Please accept my
apologies for my initial mistake here.

This plausibly would be via memory-fence instructions injected via
compiler extensions (either in source, or as a reaction to something
like the volatile attribute).
The size isn't incremented untill after the object is written,

There is no guarantee that there is a single size field. The
temporary "inconsistent state" issue is worse with .size()
than .empty(), but it's theoretically still there.
so when
the poping thread reads 1 element that element definally exsists (it
can't be in cache and not already written to the internal array).

Even assuming that writes from cache to main memory are in time order
(which is inobvious, if not wrong), and that there is a size field,
there is no guarantee that the size field will be updated after
actually adding the element in cache.
 
G

gpderetta

That's not the case. You're assuming that a second thread will see data
changes in the order in which the first thread makes them, and that
isn't the case unless you intervene to sychronize the two threads. In a
multi-threaded application running on a single processor that
assumption is valid,

Barring compiler optimizations. Hardware is not the only layer that
can reorder read and writes.

Such code is not save even in a single cpu scenario.
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...

[ ... ]
Avoid it, or wrap it? I use it regularly for communicating
between threads; my Queue class is based on it.

I'd generally avoid it. Wrapping it works, but presents trade offs I
rarely find useful.

The problem with std::deque is that it can and will resize itself when
you add an item and there's not currently room to store that item. Doing
so, however, normally involves allocating a block of memory from the
free store. As such, this operation can be fairly slow. That translates
to a fairly substantial amount of time that's being spent just in
storing the data rather than processing the data. It only really makes
sense when your work items are sufficiently coarse-grained that the
dynamic allocation will be substantially faster than processing a single
work item.

I generally prefer to keep the processing sufficiently fine grained that
when/if the queue is full, it's likely to be faster to wait for an
existing item to be processed than to allocate more memory for the
queue. This devotes nearly all CPU time to processing the data instead
of allocating memory to store it.

Now, my own work with multiple threads has mostly been fairly easy to
break up into relatively small "chunks", and has been mostly quite
processor intensive. Those mean that 1) giving up processor time means
the consumer side can run faster, and 2) the time it would take to
allocate memory for queue would exceed the time taken to remove (at
least) one item from the queue.

OTOH, I can certainly imagine situations where those weren't true. An
obvious example would be something like a network router. In this case,
the "consumer" side is basically sending out network packets. More CPU
time won't speed up network transmissions, and transmitting a single
network packet undoubtedly takes longer than allocating a block of
memory.
And if you replace buffer, in_pos and out_pos with
std::deque<T>, where's the problem.

There isn't necessarily a problem -- but I think the choice between the
two is one of some real trade offs rather than just a matter of whether
you manipulate the pointers into the queue yourself, or use some pre-
written code to do the job.

Of course, you can/could also use reserve() and size() on the deque to
manage it as a fixed-size container, but I don't think at that point
you're really gaining much by using a deque at all. OTOH, it might still
make sense when/if you had both fixed- and variable-sized queues, and
this allowed you to share most code between the two.
 
Z

zaimoni

I don't quite understand what you say ("on another core" part), but
I'm pretty sure it is safe. The pop'ing thread is only taking the
first element, and the push'ing thread is appending to the end.

You're assuming that std::deque has its own internal mutex (or other
compiler trickery reacting to volatile) so that each individual member
function call has its own lock-unlock against that mutex. This is not
required by the C++ standard, but can be remediated by a template
wrapper class at noticeable inefficiency.

A quick check on Google suggests this is not true (MSDN Developer
Network search hit) even for a C# Queue; in C# you would have to use
the Queue.Synchronized method to obtain a wrapper that was moderately
thread-safe.

Without that hidden mutex, the problem without synchronization in C++
(with a size field as an implementation detail, that is incremented
before actually doing the push; translate calls to C# as required):

| Core push-thread #1 cache, q.empty() | --- | main RAM | --- | Core
pop-thread cache, processing former last record |
| Core push-thread #1 cache, 1==q.size(), main queue inconsistent |
--- | main RAM, q.empty() | --- | Core pop-thread cache, processing
former last record |
| Core push-thread #1 cache, 1==q.size(), main queue inconsistent |
--- | main RAM, 1==q.size(), main queue inconsistent | --- | Core pop-
thread cache, processing former last record |
| Core push-thread #1 cache, 1==q.size(), main queue inconsistent |
--- | main RAM, 1==q.size(), main queue inconsistent | --- | Core pop-
thread cache, wants to test !q.empty(), chooses to refresh cache from
main RAM |
| Core push-thread #1 cache, 1==q.size(), main queue inconsistent |
--- | main RAM, 1==q.size(), main queue inconsistent | --- | Core pop-
thread cache, 1==q.size(), main queue inconsistent; !q.empty() so
invokes q.front() | UNDEFINED BEHAVIOR

Worst-case scenario leaves the queue in a permanently inconsistent
state from the push and pop both trying to execute at once. All of
the above should translate into C# fairly directly for an
unsynchronized Queue.

Without an internal-detail size field, 0 < q.size() will have
undefined behavior in the pop-thread (from inconsistent state). !
q.empty() might be well-defined even then.

With a hidden mutex providing per-method synchronization, the sole pop-
thread is forced to wait until the push-thread completed and testing
q.empty() / 0 < q.size() works exactly as long as there is exactly one
pop-thread, no critical sections at all needed.
 
J

James Kanze

It's absolutely safe only if the std::deque has been specially
implemented so that each member function has "enough"
synchronization.

I don't think that that's possible, given the interface.
 
J

James Kanze

(e-mail address removed)>, (e-mail address removed)
says...
[ ... ]
Avoid it, or wrap it? I use it regularly for communicating
between threads; my Queue class is based on it.
I'd generally avoid it. Wrapping it works, but presents trade
offs I rarely find useful.

I find the trade off of using tried and tested code, rather than
having to write it myself, useful. (Of course, if you already
had your basic queue working before STL came along...)
The problem with std::deque is that it can and will resize
itself when you add an item and there's not currently room to
store that item. Doing so, however, normally involves
allocating a block of memory from the free store. As such,
this operation can be fairly slow. That translates to a fairly
substantial amount of time that's being spent just in storing
the data rather than processing the data. It only really makes
sense when your work items are sufficiently coarse-grained
that the dynamic allocation will be substantially faster than
processing a single work item.

I suppose that that could be an issue in some cases. For the
most part, my queues never contain very many entries at one
time, although there is a lot of pushing and popping, and the
applications run for very long times (sometimes years without
stopping). Which means that the queue effectively reaches its
largest size very quickly, and then there is no more dynamic
allocation from it. (On the other hand, my messages tend to be
polymorphic, and dynamically allocated.)
I generally prefer to keep the processing sufficiently fine
grained that when/if the queue is full, it's likely to be
faster to wait for an existing item to be processed than to
allocate more memory for the queue. This devotes nearly all
CPU time to processing the data instead of allocating memory
to store it.

I'll admit that I've never had any performance problems with
deque in this regard, although I can easily imagine applications
where it might be the case.
Now, my own work with multiple threads has mostly been fairly
easy to break up into relatively small "chunks", and has been
mostly quite processor intensive. Those mean that 1) giving up
processor time means the consumer side can run faster, and 2)
the time it would take to allocate memory for queue would
exceed the time taken to remove (at least) one item from the
queue.
OTOH, I can certainly imagine situations where those weren't
true. An obvious example would be something like a network
router. In this case, the "consumer" side is basically sending
out network packets. More CPU time won't speed up network
transmissions, and transmitting a single network packet
undoubtedly takes longer than allocating a block of memory.

Yes. One of the advantages of forwarding to a queue, and
letting another process handle it, is that you can get back to
listening at the socket that much faster. And since the system
buffers tend to have a fairly small, fixed size...
There isn't necessarily a problem -- but I think the choice
between the two is one of some real trade offs rather than
just a matter of whether you manipulate the pointers into the
queue yourself, or use some pre-written code to do the job.
Of course, you can/could also use reserve() and size() on the
deque to manage it as a fixed-size container, but I don't
think at that point you're really gaining much by using a
deque at all. OTOH, it might still make sense when/if you had
both fixed- and variable-sized queues, and this allowed you to
share most code between the two.

In the end, the main gain was that I didn't have to write any
code to manage the queue. And that other people, maintaining
the software, didn't have to understand code that I'd written.
 
N

NvrBst

That's not the case. You're assuming that a second thread will see data
changes in the order in which the first thread makes them, and that
isn't the case unless you intervene to sychronize the two threads. In a
multi-threaded application running on a single processor that
assumption is valid, but for general threading it isn't. When there are
multiple processors running multiple threads, each processor has its
own memory cache, and writes to one cache aren't automatically
coordinated with reads from another cache. Seeing a 1 for the size
doesn't mean that the contents of the element that was written have
also been transferred. You have to ensure that all newly written data
has been transferred to the cache for the processor that's reading the
data, and you do that with a mutex or a conditon variable.

Let me be blunt: you're in over your head here. Java and C#  make it
look like it's easy to write multi-threaded applications, but they just
paper over the complexities and make it easy to write an application
that works just fine until it crashes, which might happen immediately,
and might happen six months from now. Writing fast, robust
multi-threaded applications requires top down design, and can't be done
merely with libraries.

Get a good book on multi-threading. Dave Butenhof's "Programming with
POSIX Threads" would be a good place to start. It's about POSIX, but
the concepts are pretty much the same everywhere.

--
  Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

I'm still fairly confidant that it would be safe for the given
situation. The Queue.Sychronization(Queue<T>) will sychronize a Queue
as said before, which is definally needed for most multithreaded
situations, however, for this specific situation I still think it's
safe. Before I was talking about a single processor, which I'm 100%
confidant it's fine, but I still think it's fine for multiple
processors. The only way it wouldn't be safe in a multle processor
environment is if the push's write to the CPU Cache on all processors
when it add's an element, which I don't think it does; or if the other
CPU Cache's are somehow updated by the Queue's "Push". Maybe I am
over my head, but the problem your trying to tell me is I think:

Thread A) Queue is Empty, Push's Element, Increments Version,
Increments Size, ??Updates 2nd CPU Cache??("B" starts before the "??")
Thread B) Reads Size as 1, Pop's Element, but CPU Cache for _array[0]
was cached previously returning garbage, Increments Version, Decrments
Size.

What I assumed was that the CPU Cache Misses, and would fall through
to reading it from RAM (thus getting correct element), or that .NET
would see Queue version has changed and CPU Cache would just be
initally bypassed, or that there was a shared Cache and each CPU cache
goes to the main cache (which is definally updated in the above
situation) to make sure it's valid. I don't know for sure, but, I
still believe it's safe for the given situation.

All this is .NET Framework, C#, Queue<T> - not C++ deque. My
computers are single processors which might be why I have never run
into the problem personally, but it will be something I keep in mind
for if I do run into a crash I can't explain :) Thanks.
 
J

James Kanze

On 2008-09-01 03:34:02 -0400, NvrBst <[email protected]> said:
That's not the case. You're assuming that a second thread will see data
changes in the order in which the first thread makes them, and that
isn't the case unless you intervene to sychronize the two threads. In a
multi-threaded application running on a single processor that
assumption is valid, but for general threading it isn't. When there are
multiple processors running multiple threads, each processor has its
own memory cache, and writes to one cache aren't automatically
coordinated with reads from another cache. Seeing a 1 for the size
doesn't mean that the contents of the element that was written have
also been transferred. You have to ensure that all newly written data
has been transferred to the cache for the processor that's reading the
data, and you do that with a mutex or a conditon variable.
Let me be blunt: you're in over your head here. Java and C# make it
look like it's easy to write multi-threaded applications, but they just
paper over the complexities and make it easy to write an application
that works just fine until it crashes, which might happen immediately,
and might happen six months from now. Writing fast, robust
multi-threaded applications requires top down design, and can't be done
merely with libraries.
Get a good book on multi-threading. Dave Butenhof's "Programming with
POSIX Threads" would be a good place to start. It's about POSIX, but
the concepts are pretty much the same everywhere.
--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

I'm still fairly confidant that it would be safe for the given
situation. The Queue.Sychronization(Queue<T>) will sychronize a Queue
as said before, which is definally needed for most multithreaded
situations, however, for this specific situation I still think it's
safe. Before I was talking about a single processor, which I'm 100%
confidant it's fine, but I still think it's fine for multiple
processors. The only way it wouldn't be safe in a multle processor
environment is if the push's write to the CPU Cache on all processors
when it add's an element, which I don't think it does; or if the other
CPU Cache's are somehow updated by the Queue's "Push". Maybe I am
over my head, but the problem your trying to tell me is I think:

Thread A) Queue is Empty, Push's Element, Increments Version,
Increments Size, ??Updates 2nd CPU Cache??("B" starts before the "??")
Thread B) Reads Size as 1, Pop's Element, but CPU Cache for _array[0]
was cached previously returning garbage, Increments Version, Decrments
Size.

What I assumed was that the CPU Cache Misses, and would fall through
to reading it from RAM (thus getting correct element), or that .NET
would see Queue version has changed and CPU Cache would just be
initally bypassed, or that there was a shared Cache and each CPU cache
goes to the main cache (which is definally updated in the above
situation) to make sure it's valid. I don't know for sure, but, I
still believe it's safe for the given situation.

All this is .NET Framework, C#, Queue<T> - not C++ deque. My
computers are single processors which might be why I have never run
into the problem personally, but it will be something I keep in mind
for if I do run into a crash I can't explain :) Thanks.
 
J

Jerry Coffin

[ ... ]
I find the trade off of using tried and tested code, rather than
having to write it myself, useful. (Of course, if you already
had your basic queue working before STL came along...)

The problem is that the part that's tried and tested is downright
trivial compared to what's not -- and at least from the looks of things
in this thread, using that tried and tested code makes it even more
difficult to get the hard part right, so I suspect it's a net loss.

....and yes, if memory serves, my queue code was originally written under
OS/2 1.2 or thereabouts (originally in C, which still shows to some
degree).

[ ... ]
I suppose that that could be an issue in some cases. For the
most part, my queues never contain very many entries at one
time, although there is a lot of pushing and popping, and the
applications run for very long times (sometimes years without
stopping). Which means that the queue effectively reaches its
largest size very quickly, and then there is no more dynamic
allocation from it.

In fairness, though I hadn't really thought about it previously, it's
likely that when I've profiled it, I've tended to use relatively short
runs, which will tend to place an undue emphasis on the initialization,
so it's likely to make less difference overall than it may have looked
like. In any case, I'd agree that it's rarely likely to be a huge factor
in any case.

[ ... ]
In the end, the main gain was that I didn't have to write any
code to manage the queue. And that other people, maintaining
the software, didn't have to understand code that I'd written.

I suppose it might make a difference, but I have a hard time believing
that anybody who could understand the semaphore part would have to pause
for even a moment to understand the queue part. Duplicating it without
any fence-post errors might take a little longer, but even that's still
not exactly rocket science.
 
J

James Kanze

[ ... ]
I find the trade off of using tried and tested code, rather
than having to write it myself, useful. (Of course, if you
already had your basic queue working before STL came
along...)
The problem is that the part that's tried and tested is
downright trivial compared to what's not -- and at least from
the looks of things in this thread, using that tried and
tested code makes it even more difficult to get the hard part
right, so I suspect it's a net loss.

I don't see where the actual implementation of the queue would
affect the rest, unless you're doing some tricky optimizations
to make it lock-free. (std::deque is obvious not lock-free; you
need locks around the accesses to it.) And your code didn't
seem to be trying to implement a lock free queue.

On the other hand, since I work under Unix, my queues used a
condition rather than semaphores, so the issues are somewhat
different, but basically, all I did was copy the code for the
wait and send from Butenhof, and stick an std::deque in as the
queue itself. You can't get much simpler.
...and yes, if memory serves, my queue code was originally
written under OS/2 1.2 or thereabouts (originally in C, which
still shows to some degree).

Which explains why you find the implementation of the queue so
simple:). Most things are simple when you've been maintaining
them for so many years. My experience is that unless you
implement the queue trivially, using something like std::list,
getting the border conditions right isn't always that obvious
(but once you've got them right, the code is trivial).
[ ... ]
In the end, the main gain was that I didn't have to write any
code to manage the queue. And that other people, maintaining
the software, didn't have to understand code that I'd written.
I suppose it might make a difference, but I have a hard time
believing that anybody who could understand the semaphore part
would have to pause for even a moment to understand the queue
part. Duplicating it without any fence-post errors might take
a little longer, but even that's still not exactly rocket
science.

Well, my own implementation uses condition variables, rather
than semaphores. But it's really, really simple. FWIW
(modified to use Boost, rather than my home build classes, and
with support for handling timeouts removed):

class QueueBase
{
public:

QueueBase() ;
~QueueBase() ;

void send( void* message ) ;
void* receive() ;

bool isEmpty() const ;

private:
boost::mutex mutex ;
boost::condition_variable
cond ;
std::deque< void* > queue ;
} ;

template< typename T >
class Queue : public QueueBase
{
public:

~Queue()
{
while ( ! isEmpty() ) {
receive() ;
}
}

void send( std::auto_ptr< T > message )
{
QueueBase::send( message.get() ) ;
message.release() ;
}

std::auto_ptr< T > receive()
{
return std::auto_ptr< T >(
static_cast< T* >( QueueBase::receive() ) ) ;
}
} ;

QueueBase::QueueBase()
{
}

QueueBase::~QueueBase()
{
assert( queue.empty() ) ;
}

void
QueueBase::send(
void* message )
{
boost::unique_lock< boost::mutex >
lock( mutex ) ;
queue.push_back( message ) ;
cond.notify_one() ;
}

void*
QueueBase::receive()
{
boost::unique_lock< boost::mutex >
lock( mutex ) ;
while ( queue.empty() ) {
cond.wait( lock ) ;
}
void* result = queue.front() ;
queue.pop_front() ;
return result ;
}

bool
QueueBase::isEmpty() const
{
boost::unique_lock< boost::mutex >
lock( mutex ) ;
return queue.empty() ;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,169
Messages
2,570,920
Members
47,463
Latest member
FinleyMoye

Latest Threads

Top