I pulled this out from another thread as it seems to be a good topic. As
there was a bunch of hypothesizing going on about the design of
safety/life-critical systems in regards to how errors ("exceptions",
tomatoe/tomato) are handled. At least one person suggested that abort()
constitutes a fail-fast (seehttp://en.wikipedia.org/wiki/Fail-fast)
design, which seems completely wrong.
C++ Question:
Is C++ used in life-critical systems? Expound please.
Non-C++-specific Question:
Recognizing that higher-level supervision (by other systems) is surely a
common design in critical systems, thatwithstanding, how does any one
specific program handle detected bugs (the program itself detected a bug)
at runtime in released software?
I've never worked on them, but the main school of thought for critical
systems is:
1- Test a lot, code reviews, etc., to decrease the bug count as much
as possible.
2- Fault tolerance through components where each component has fault
isolation from the rest of the components, where the components
provide redundancy, backups and rollover, etc.
3- Fail-fast in each component.
Basically, the goal is to provide a working product at the end of the
day. You can do this by reducing bug-count, but in most programming
languages, and especially in C and C++, a single bug anywhere in the
process can completely corrupt the process. So, have multiple
independent processes so that if one fails, another can take over. I
used the example of a unix process because unix processes have decent
fault isolation with each other - that is, if one process has a bug
and fails, it's unlikely to affect another process. Fault isolation is
key to allow backups and rollover.
The processes which fail should fail fast - you're in an unknown
state, so hell if you know what will happen if you execute anything,
such as logging code, error recovery code, and so on. This applies
only to bugs, aka unexpected failures. If the failure is expected
(such as disk is full), and you prepared for it, then you don't need
to instantly die. However, for an unexpected null pointer access, that
is generally a good example of where you should just die immediately,
preferably leaving the equivalent of a core dump so someone can look
at it and fix the problem.
I emphasized processes above, but that was just a specific example. A
process can still affect another process, so perhaps fault isolation
at the hardware level is called for. Perhaps you're worried about the
power supply, so you get a backup power supply as well. The main ideas
of fault tolerance through multiple components with fault isolation
doesn't change, but the specifics of the situation do.
I recall reading about a post where a NASA space probe had a bug, and
it tripped something and gave the equivalent of a core dump. NASA had
planned for this correctly, and had put in a backup system with fault
isolation so that they were able to get that core dump, debug it, and
upload new code to the probe.
IIRC, there was another example, I forget offhand from where, where a
life critical system was made with 4 separate "computers". 3 of the
"computers" implemented most of the functionality, redundantly. Each
was using a different algorithm implemented by different teams. The
last of the "computers" was a simple thing that took a vote of the 3
main guys. If they all agreed, the vote taker acted on it. If there
was a 2-1 split, the vote taker would reset the "wrong" guy.