Ok. Where to start. Let me just type out what's going through my head
here, and maybe people can add.
The basic idea is that we have some threads doing work as part of a
"job". An external thread has access to the job object, and it can
issue a stop request on the job. The threads doing the job need to
periodically check some global-like stop flag, and if it's set, then
they need to do the stop - leave whatever data in a consistent state,
release any resources, and have all threads actually end.
A thread could not end from being blocked. In this case, I want to
distinguish between two different kinds of blocking - "temporarily"
blocking on an external thing like a disk read, and blocking on
internal logic like on a condition variable. At this point in time, I
don't think any of us really care to solve the first problem of a
short, but blocking, system call like a file read. It'll finish up
relatively quickly no matter what we do, and trying to end it early is
way too much work for us at the moment. The blocking we care about is
inside userland where we might block indefinitely, waiting for some
other userland action to occur which might never occur because that
thread noticed the stop flag and died before completing that action.
So, I would guess we have two basic options to handle this. One, we
can make all userland logic blocking use timeouts. Every single
relevant condition variable wait, etc., would be on a timeout, and
after each timeout the stop flag is checked. This is quite tedious and
error prone, and more importantly it has bad performance
characteristics - the bigger the number the slower the response time,
but a small number means more busy waiting. Ideally, we would like to
avoid the rule of every blocking wait must use a timeout.
So, that leaves plan two. When a thread blocks on something in a sane
program, it expects to be woken up again. The problem comes when the
guy who was supposed to wake it up noticed the stop flag and died
early without doing whatever is required to wake up the blocked
thread. This is about as far as I got, but I was thinking about basic
interfaces, like queues, and I was wondering to myself if there was
some abstraction which could use RAII to ensure that we cannot die
without waking up the other side of the queue. This is where I'm
currently stuck. There's too many idioms and patterns for inter-thread
communication for me to be able to armchair reason about. The goal I
had was that correct usage in the non-stop case implies correct usage
for the stop case, but I don't really see how this could be
accomplished offhand.
Ok. To continue. We're concerned about stuff like the following case:
Lock lock(mutex);
while ( ! condition())
{ //no timeout
wait(condition_variable, mutex);
}
In this case, on the normal execution path for a correct program,
there is another thread which will acquire mutex, change condition to
be true, and signal condition_variable. The problem is that on the
stop request execution path, that other thread may not change the
condition to true and signal that condition_variable. Hell, during a
stop request, the producer thread may not even make it to the queue
loop - the creating thread may have just noticed the stop request and
never made the producer thread, or the producer thread noticed the
stop flag before it made it to the queue loop, etc.
I think we can all agree that, give no timeout waits, the consumer
needs to be changed to look like:
Lock lock(mutex);
for (;
{ if (job_is_stopped())
throw stop_exception();
if (condition())
break;
wait(condition_variable, mutex);
}
For a concurrent_queue, you would need to change the consumer to check
the stop flag before processing what might be a sentry value sent by
another thread to get it to wake up to notice the stop flag.
for (;
{ Work work = queue.atomicBlockingTopAndPop();
if (job_is_stopped())
throw stop_exception();
if (is_no_more_work_sentry(work))
break;
do_work(work);
}
For a semaphore, the code "if (job_is_stopped()) throw
stop_exception();" should be sprinkled about liberally.
Presumably something similar must be done for the blocking side of any
other blocking inter-thread communication abstraction.
What is important to me is where is the code which wakes up these
blocking consumers so the blocking consumer can check the stop flag?
We can discuss whether we can sentry values for the queue, or for it
to return a pair so pair.second can signal a sentry value, or add
functions "stop", or whatever variation. The hard part is ensuring
that whatever scheme you used it called on every possible stop path.
For this, I think we need to think about this in a different way. We
need to define a new responsibility like object ownership.
For the queue example, at some point the queue is made, and then it is
published in a safe way to the producers and blocking consumers. After
this handoff process, consumer threads start blocking on the queue, so
during this handoff process, someone must gain the additional
responsibility to wake up all consumer threads during a stop. I see
this very much akin to object ownership, and I do want to use RAII to
solve but. However, I don't know who should gain this responsibility,
or where it should be. Those answers depend greatly on how the
producers and consumers are started, how the queue is safely published
to them, etc. The question then becomes "What is good practice
concerning this?", "Any good idioms or patterns?", and "Is this
feasible and maintainable, or are we going to get some indefinitely
blocked threads sometimes when we stop a job?".
PS: I'm still thinking about how non-sequentially-consistent atomics
play into this. At the very least, let's consider the
condition_variable case:
Lock lock(mutex);
for (;
{ if (job_is_stopped())
throw stop_exception();
if (condition())
break;
wait(condition_variable, mutex);
}
If that stop flag does not have all writes and reads sequentially
consistent, then it's possible for my above idea to fail.
Specifically, there is a thread X which has responsibility to wake up
the blocked consumer on a stop. Suppose that thread notices a stop,
then signals that condition variable. Could that consumer thread wake
up and not notice the stop flag? From my (somewhat poor)
understanding, yes. It could go right back to sleep. In fact, it could
require an arbitrary number of broadcasts when the stop flag does not
have sequential consistency. Non-sequential-consistency makes this
even harder to reason about, and I'm not sure I know anything useful
at this point. I just know enough to know I'm screwed.