Again, I don't think anyone is suggesting a naive retry of a failed
job on OOM.
That's precisely what you suggested when you claimed that all we had
to do is rollback down the stack!
We're suggesting logging the error first, and possibly
more sophisticated schemes of retry.
Then you need to define these "more sophisticated schemes", and define
such a scheme that's actually worth it to implement most of the time.
Interesting. I've tried to google for more information on this topic,
but there's not much I can find offhand about the actual initial
motivations for overcommit, and the current motivations for leaving
overcommit in.
Furthermore, the current situation is even more broken than I thought.
It seems there's no good way to limit (virtual) memory usage on a per-
user basis in Linux (and presumably other unix-like OSes (?)).
Yes, there is. You can set the maximize size of the VAS per-process,
and the maximum number of processes, which provides a hard upper
bound. If you want finer grained controls, you have to patch the
Linux kernel; there are various patches out there that can accomplish
what you want. Some other UNIX systems (e.g., Solaris) provide finer
grain controls out of the box.
There are other ways, such as containers and virtualization, to
accomplish similar feats. They're rarely worth it.
Thus,
your options are 1- overcommit and the OOM killer on a relatively
random process, or 2- no overcommit and an OOM failure return code or
exception from malloc et al in a relatively random process. This is
completely broken.
Compared to what alternative? The traditional alternative is a kernel
panic, and that's not necessarily better (nor worse). For better or
ill, we've become accustomed to the assumption we can treat memory as
an endless resource. Most of the time, that assumption works out
pretty well. When it falls apart, it's not shocking the resulting
consequences are pretty terrible.
I often wonder how such states can persist for so
long. It's not just me, right? Other people do see how this is broken,
right? Why is no one fixing this? It can't be that hard to implement
per-user limits.
Because per-user limits don't fix the problem, unless you limit every
user on the system in such a fashion to never exceed your commit
limit. Even then, you're still not promised the "memory hog" is the
one that's going to be told no-more memory. That's why we don't
bother: identifying the "memory hog" is much too hard for a
computer.
It may well be a broken situation, but there's also not a good
solution.
Why do you claim that those are their goals? Has a stakeholder ever
given you such requirements?
Yes, plenty of people here seem to be convinced handling OOM (by not
crashing) is a requirement for writing "robust" software; that means
being able to perform some sort of action (such as logging) and
continue onward after the OOM condition. However, some people here
also seem to believe that performing I/O doesn't effect program state,
so their requirements are probably not worth fulfilling.
Still, this plays into my larger point: even if you have a situation
where you can respond to an OOM condition in some meaningful fashion
other than termination, you're still not assured of success. Put more
plainly, handling OOM doesn't ipso facto ensure additional
robustness. You need to be able to handle the OOM condition and have
a reasonable assurance that your response will actually succeed. It
may be an overstatement on my part to say, "Succeed no matter what",
but making the software more "robust" certainly means tending closer
to that extreme than the opposite.
As a concrete example, writing all this OOM handling code does me no
good if when my process finally hits a OOM condition, my whole
computer is going to die anyway. For many applications, this is
one of two reasons why they'll ever see an OOM condition.
Of course, the reality is that all of this rarely makes software more
robust, because actual robust systems generally are pretty tolerant of
things like program termination. As I've said before, frequently it's
even preferable to terminate, even when it would be possible to
recover from the error. This makes the value proposition of handling
OOM even less worthwhile.
However, in the above quote, I talked about using non-portable
mechanisms to achieve some degree of reliability, which is relevant to
C++ and this newsgroup.
Yes, I know. And my whole point from the start is that handling OOM
is too much of a pain to bother. Having to give up writing portable
code definitely falls under "too much of a pain to bother" for lots of
code and lots of programmers. None of this should be controversial in
the least or require this much discussion.
Adam