Well, as I know, I agree with that. But you also know that I am
convinced that the IT world is STILL heading away from there :-(
when i got to do the resource manager
http://www.garlic.com/~lynn/subtopic.html#fairshare
http://www.garlic.com/~lynn/subtopic.html#wsclock
one of the supporting processes was an automated test & benchmarking
process
http://www.garlic.com/~lynn/subtopic.html#bench
and eventually did one sequence of 2000 benchmarks taking 3 months
elapsed time before first customer ship.
with the benchmarking process ... was able to define almost any sort
of workload characteristics ... including very stressful ones that
turned out were guarenteed to crash the system. early on as a result of
this ... one of the side-tracks was to go in and completely redesign
and rewrite the kernel serialization infrastructure ... that resulted
in the elimination of all stress-test induced system failures as well
as all observed situations of hung/zombie processes .... some of
hung/zombies as well as other fault diagnostic stuff
http://www.garlic.com/~lynn/subtopic.html#dumprx
the main product release after the release of the resource manager
included SMP support ... which had a lot of dependencies on the
resource manager code ... which then created a business problem.
http://www.garlic.com/~lynn/subtopic.html#smp
the resource manager wss the first forey into priced kernel software
with guidelines that direct hardware support kernel software was
still free. having smp "free" software dependent on priced (resource
manager) software violated those business guidelines. As a result ..
something like 80percent of code in original resource manager was
removed and incorporated into the "base, free" kernel software.
so problem reporting and fix process had a procedures that assigned
a number (that was incremented sequentially) and fixes for specific
problems were given the same number. with the very first version
of vm370, the numbering started ... and as far as i know may never
have been reset (i think i've recently seen references to sequential
numbers in the 60k range?).
Any way ... if you have to be dilligent to maintain kernel integrity
(failures because of dangling activities after processes have gone
away) and/or hung/zombie processes (waiting for some activity
to complete). A couple releases after the resource manager went out,
there was a fix introduced into the dispatcher to ignore various
kinds of events under certain circumstances; this had a fix number of
something like 15,3xx (or maybe 15,1xx?). In any case, it resulted
in re-introducing hung/zombie processes.
This occured after i had redone the i/o system to make it bullet proof
so the disk enginneering lab could do their work in an operating
system environment ... instead of doing everything with dedicated,
stand-alone machines:
http://www.garlic.com/~lynn/subtopic.html#disk
I created an update that removed the effects of the fix that
re-introduced hung/zombie processes ... and tried to find out what was
the original justification for generating it in the first place.
of course, all the ha/cmp work was pretty much trying to figure out
how to retrofit availability andd assurance to existing
infrastructures
http://www.garlic.com/~lynn/subtopic.html#hacmp
the stuff for electronic commerce was someplace in-between ... characterize
all possible failure modes anywhere in the infrastructure ... and then
define recovery and/or diagnostic processes to handle every possible
scenario. ... aka the previous electronic commerce ref
http://www.garlic.com/~lynn/2004p.html#23
however, in approx, the same electronic commerce time-frame ... we
looked at taking a more systemic approach. we had a jad with taligent
about their environment ... taking the analogy from the original
os/360 system services .... what taligent characteristics would be
needed for assurance and availability system services. we had a one
week JAD where we walked thru all of the taligent infrastructure,
specifying what needed to be added/changed for assurance and
availability. at the end, we had come up with two new
availability/assurance frameworks ... and, in addition, about a 30%
hit to the existing taligent frameworks
you would still need the up-front failure analysis ... but could
possibly reduce application code lines from a 4-10 times increase to
possibly only a 1.5 to 2.0 times increase (with some higher level
assurance and availability abstrations being provided by the taligent
infrastructure).
random past taligent references:
http://www.garlic.com/~lynn/2000e.html#46 Where are they now : Taligent and Pink
http://www.garlic.com/~lynn/2000e.html#48 Where are they now : Taligent and Pink
http://www.garlic.com/~lynn/2000.html#10 Taligent
http://www.garlic.com/~lynn/2001j.html#36 Proper ISA lifespan?
http://www.garlic.com/~lynn/2001n.html#93 Buffer overflow
http://www.garlic.com/~lynn/2002.html#24 Buffer overflow
http://www.garlic.com/~lynn/2002i.html#60 Unisys A11 worth keeping?
http://www.garlic.com/~lynn/2002j.html#76 Difference between Unix and Linux?
http://www.garlic.com/~lynn/2002m.html#60 The next big things that weren't
http://www.garlic.com/~lynn/2003d.html#45 IBM says AMD dead in 5yrs ... -- Microsoft Monopoly vs. IBM
http://www.garlic.com/~lynn/2003e.html#28 A Speculative question
http://www.garlic.com/~lynn/2003g.html#62 IBM says AMD dead in 5yrs ... -- Microsoft Monopoly vs. IBM
http://www.garlic.com/~lynn/2003j.html#15 A Dark Day
http://www.garlic.com/~lynn/2004c.html#53 defination of terms: "Application Server" vs. "Transaction Server"
http://www.garlic.com/~lynn/2004l.html#49 "Perfect" or "Provable" security both crypto and non-crypto?