help analyzing cause of return code

axeman · Feb 22, 2006

Synopsis:

A variant of a typical host availability / pinger script has performed
well for many years. Multiple daemons process various lists at various
intervals with various timeouts. The tool was recently modified to
support attempting sequences of tests (i.e. ping and TCP port test, ...
vs. just one test). The daemons will run fine for days, but then some
will suddenly receive non-zero return codes for every command/test they
perform. Specifically, return code 16777215 (-1 before shift >> 8).
Searches have suggested problems with CHLD signals, though they have
never been a problem before. Appreciate any insight.

Code versions:

AIX 4.3.3
Perl 5.005_03

Basic daemon model:

....

sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}

....

$SIG{'HUP'} = 'IGNORE'; # don't die on these signals
$SIG{'PIPE'} = 'IGNORE';
$SIG{'TERM'} = 'IGNORE';
$SIG{'ALRM'} = \&timed_out;
$SIG{'USR1'} = \&quiesce;
use POSIX ":sys_wait_h";

....

foreach $test ( split(/;/,$TESTS) ) {

# std wrapper for timed operation, return code in $rc,
output in @out
($rc,@out) = eval {
alarm($timeout);
$test =~ s/HOST/$check/g;
$test[$testCount] = $test;
@eout = `$test 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
$timeouts++;
$test_timeout[$testCount] = 1;
}
$test_rc[$testCount] = $rc;
$test_console[$testCount] = join('',@out);
$testCount++;
$spawned++;

last if( $rc == 0 ); # successful test
}

....

# clean up any hung children for every 10 or more spawned
processes

if( $spawned > 10 ) {
reap; # NOTE - also new code - this recursively traverses
the process tree
# and kill KILL's any children
$spawned = 0;
}

# clean up zombies - not done w/signal handler due to unreliable
signals

while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0 ) {}

....

usenet · Feb 22, 2006

axeman said:
vs. just one test). The daemons will run fine for days, but then some
will suddenly receive non-zero return codes for every command/test they
perform.

Is your process reaper reaping? For some odd reason, AIX has an
insanely-low default max-pid-per-user limitation (I think default is
256 - I usually run it at 1024). Check "smitty chgsys" and check your
process table.

You would have messages in /var/spool/mail if you were pid-starved.
And, of course, if the process is running as root, I don't think it
would matter, since (I believe) root is not limited.

FWIW, whatever is happening here probably (almost surely) has nothing
to do with Perl.

axeman · Feb 22, 2006

Thanks David.

Unfortunately, it is running as root (even thought the limit is low -
128 - and no related mail). The reaper is misnamed (not my code), it
just kills hung test procs, but does not reap their exit status, thats
what the asynchronous 'while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0
) {}' line does. CHLD signals are not mapped (i.e. left to DEFAULT).
Curiously, if I do map them to a handler or IGNORE, the bad return code
occurs always.

xhoster · Feb 23, 2006

axeman said:
Synopsis:

A variant of a typical host availability / pinger script has performed
well for many years. Multiple daemons process various lists at various
intervals with various timeouts.

How often are the timeouts actually activated?

The tool was recently modified to
support attempting sequences of tests (i.e. ping and TCP port test, ...
vs. just one test).

Did these changes change how often timeout were actually activated?

AIX 4.3.3
Perl 5.005_03
...
sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}

Does the handler need to re=install itself after being activated
on your system?

($rc,@out) = eval {
alarm($timeout);
$test =~ s/HOST/$check/g;
$test[$testCount] = $test;
@eout = `$test 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
$timeouts++;
$test_timeout[$testCount] = 1;
}

If $@ is defined but not timed out, shouldn't you do something about it?

Xho

xhoster · Feb 23, 2006

axeman said:
Thanks David.

Unfortunately, it is running as root (even thought the limit is low -
128 - and no related mail). The reaper is misnamed (not my code), it
just kills hung test procs, but does not reap their exit status, thats
what the asynchronous 'while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0
) {}' line does. CHLD signals are not mapped (i.e. left to DEFAULT).
Curiously, if I do map them to a handler or IGNORE, the bad return code
occurs always.

qx{} automatically waits for the job it spawns--that is how it sets $?.
If you set SIG{CHLD}, it will interfer with qw{}'s wait.

Xho

axeman · Feb 23, 2006

Multiple daemons process various lists at various

How often are the timeouts actually activated?

Rarely, i.e. only when a test fails / system is down, and most are
usually up.

Did these changes change how often timeout were actually activated?
No.

Does the handler need to re=install itself after being activated
on your system?

As mentioned, there is no handler, exit statuses are gathered
asynchronously.

If $@ is defined but not timed out, shouldn't you do something about it?

Yes, clearly. That code was left out (the elipses ...) because it was
not relevant to the problem.

qx{} automatically waits for the job it spawns--that is how it sets $?.
If you set SIG{CHLD}, it will interfer with qw{}'s wait.

Thanks, that makes sense.

xhoster · Feb 23, 2006

Note: snipped material restored with "] ]".

] ] > sub timed_out { # ALRM signal handler for command time-out
] ] > die "timed out";
] ] > }

As mentioned, there is no handler, exit statuses are gathered
asynchronously.

If the thing whose comment says "ALRM signal handler" is not a handler,
then what the heck is it? And why is it commented thusly?

Xho

axeman · Feb 23, 2006

Lol. Thought you meant a handler for CHLD. No, the ALRM handler does
not need to be reinstalled.

help with timed command, CHLD signals, return codes	5	Apr 12, 2006
timeout a print to stdout?	8	Apr 20, 2013
STDOUT and STDERR redirection fails for forked process	1	Jul 19, 2007
Mod_perl and Signals	4	May 17, 2007
From "The Camel Book", Ch. 16 (IPC)	7	Oct 24, 2007
(Newbie) Timed operations with eval/die	6	Jun 27, 2005
Trying to combine timeout with getting program's pid	6	Jun 20, 2005
How to kill all child processes except itself?	1	Dec 5, 2004

help analyzing cause of return code

axeman

usenet

axeman

xhoster

xhoster

axeman

xhoster

axeman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads