A
axeman
Synopsis:
A variant of a typical host availability / pinger script has performed
well for many years. Multiple daemons process various lists at various
intervals with various timeouts. The tool was recently modified to
support attempting sequences of tests (i.e. ping and TCP port test, ...
vs. just one test). The daemons will run fine for days, but then some
will suddenly receive non-zero return codes for every command/test they
perform. Specifically, return code 16777215 (-1 before shift >> 8).
Searches have suggested problems with CHLD signals, though they have
never been a problem before. Appreciate any insight.
Code versions:
AIX 4.3.3
Perl 5.005_03
Basic daemon model:
....
sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}
....
$SIG{'HUP'} = 'IGNORE'; # don't die on these signals
$SIG{'PIPE'} = 'IGNORE';
$SIG{'TERM'} = 'IGNORE';
$SIG{'ALRM'} = \&timed_out;
$SIG{'USR1'} = \&quiesce;
use POSIX ":sys_wait_h";
....
foreach $test ( split(/;/,$TESTS) ) {
# std wrapper for timed operation, return code in $rc,
output in @out
($rc,@out) = eval {
alarm($timeout);
$test =~ s/HOST/$check/g;
$test[$testCount] = $test;
@eout = `$test 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
$timeouts++;
$test_timeout[$testCount] = 1;
}
$test_rc[$testCount] = $rc;
$test_console[$testCount] = join('',@out);
$testCount++;
$spawned++;
last if( $rc == 0 ); # successful test
}
....
# clean up any hung children for every 10 or more spawned
processes
if( $spawned > 10 ) {
reap; # NOTE - also new code - this recursively traverses
the process tree
# and kill KILL's any children
$spawned = 0;
}
# clean up zombies - not done w/signal handler due to unreliable
signals
while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0 ) {}
....
A variant of a typical host availability / pinger script has performed
well for many years. Multiple daemons process various lists at various
intervals with various timeouts. The tool was recently modified to
support attempting sequences of tests (i.e. ping and TCP port test, ...
vs. just one test). The daemons will run fine for days, but then some
will suddenly receive non-zero return codes for every command/test they
perform. Specifically, return code 16777215 (-1 before shift >> 8).
Searches have suggested problems with CHLD signals, though they have
never been a problem before. Appreciate any insight.
Code versions:
AIX 4.3.3
Perl 5.005_03
Basic daemon model:
....
sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}
....
$SIG{'HUP'} = 'IGNORE'; # don't die on these signals
$SIG{'PIPE'} = 'IGNORE';
$SIG{'TERM'} = 'IGNORE';
$SIG{'ALRM'} = \&timed_out;
$SIG{'USR1'} = \&quiesce;
use POSIX ":sys_wait_h";
....
foreach $test ( split(/;/,$TESTS) ) {
# std wrapper for timed operation, return code in $rc,
output in @out
($rc,@out) = eval {
alarm($timeout);
$test =~ s/HOST/$check/g;
$test[$testCount] = $test;
@eout = `$test 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
$timeouts++;
$test_timeout[$testCount] = 1;
}
$test_rc[$testCount] = $rc;
$test_console[$testCount] = join('',@out);
$testCount++;
$spawned++;
last if( $rc == 0 ); # successful test
}
....
# clean up any hung children for every 10 or more spawned
processes
if( $spawned > 10 ) {
reap; # NOTE - also new code - this recursively traverses
the process tree
# and kill KILL's any children
$spawned = 0;
}
# clean up zombies - not done w/signal handler due to unreliable
signals
while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0 ) {}
....