A
Ara.T.Howard
i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue. recently i started seeing random core dumps from them. i've isolated
the bit of code that causes the core dumps to occur - it's this
class JobRunner
#{{{
attr :job
attr :jid
attr :cid
attr :shell
attr :command
def initialize job
#{{{
@job = job
@jid = job['jid']
@command = job['command']
@shell = job['shell'] || 'bash'
@r,@w = IO.pipe
@cid =
Util::fork do
@w.close
STDIN.reopen @r
if $want_to_core_dump
keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
256.times do |fd|
next if keep.include? fd
begin
IO::new(fd).close
rescue Errno::EINVAL, Errno::EBADF
end
end
end
if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
else
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
end
end
@r.close
#}}}
end
def run
#{{{
@w.puts @command
@w.close
#}}}
end
#}}}
end
now heres the tricky bit. the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't. the context this
code executes in is complex, but here's the just of it
sqlite database transaction started - this opens some files like db-journal,
etc.
a job is selected from database
fork job runner - this closes open files except stdin, stdout, stderr, and
com pipe
the job pid and other accounting is committed to database
the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.
back to the core dump...
basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT. if i do NOT close them the parent never core dumps. the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD. i have some signal handlers setup for stopping/restarting that look
exactly like this:
trap('SIGHUP') do
$signaled = $sighup = true
warn{ "signal <SIGHUP>" }
end
trap('SIGTERM') do
$signaled = $sigterm = true
warn{ "signal <SIGTERM>" }
end
trap('SIGINT') do
$signaled = $sigint = true
warn{ "signal <SIGINT>" }
end
in my event loop i obviously take appropriate steps for the $sigXXX.
as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent. i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.
so - here's the question: what kind of badness could closing fd's be causing
in the PARENT? i'm utterly confused at this point and don't really know
where to look next... could this be a ruby bug or am i just breaking some
unix law and getting bitten.
thanks for any advice.
kind regards.
-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
queue. recently i started seeing random core dumps from them. i've isolated
the bit of code that causes the core dumps to occur - it's this
class JobRunner
#{{{
attr :job
attr :jid
attr :cid
attr :shell
attr :command
def initialize job
#{{{
@job = job
@jid = job['jid']
@command = job['command']
@shell = job['shell'] || 'bash'
@r,@w = IO.pipe
@cid =
Util::fork do
@w.close
STDIN.reopen @r
if $want_to_core_dump
keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
256.times do |fd|
next if keep.include? fd
begin
IO::new(fd).close
rescue Errno::EINVAL, Errno::EBADF
end
end
end
if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
else
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
end
end
@r.close
#}}}
end
def run
#{{{
@w.puts @command
@w.close
#}}}
end
#}}}
end
now heres the tricky bit. the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't. the context this
code executes in is complex, but here's the just of it
sqlite database transaction started - this opens some files like db-journal,
etc.
a job is selected from database
fork job runner - this closes open files except stdin, stdout, stderr, and
com pipe
the job pid and other accounting is committed to database
the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.
back to the core dump...
basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT. if i do NOT close them the parent never core dumps. the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD. i have some signal handlers setup for stopping/restarting that look
exactly like this:
trap('SIGHUP') do
$signaled = $sighup = true
warn{ "signal <SIGHUP>" }
end
trap('SIGTERM') do
$signaled = $sigterm = true
warn{ "signal <SIGTERM>" }
end
trap('SIGINT') do
$signaled = $sigint = true
warn{ "signal <SIGINT>" }
end
in my event loop i obviously take appropriate steps for the $sigXXX.
as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent. i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.
so - here's the question: what kind of badness could closing fd's be causing
in the PARENT? i'm utterly confused at this point and don't really know
where to look next... could this be a ruby bug or am i just breaking some
unix law and getting bitten.
thanks for any advice.
kind regards.
-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================