interrupted system call w/ Queue.get

P

Philip Winston

We have a multiprocess Python program that uses Queue to communicate
between processes. Recently we've seen some errors while blocked
waiting on Queue.get:

IOError: [Errno 4] Interrupted system call

What causes the exception? Is it necessary to catch this exception
and manually retry the Queue operation? Thanks.

We have some Python 2.5 and 2.6 machines that have run this program
for many 1,000 hours with no errors. But we have one 2.5 machine and
one 2.7 machine that seem to get the error very often.
 
J

James Mills

We have a multiprocess Python program that uses Queue to communicate
between processes.  Recently we've seen some errors while blocked
waiting on Queue.get:

IOError: [Errno 4] Interrupted system call

What causes the exception?  Is it necessary to catch this exception
and manually retry the Queue operation?  Thanks.

Are you getting this when your application is shutdown ?

I'm pretty sure you can safely ignore this exception and
continue.

cheers
James
 
R

Roy Smith

Philip Winston said:
We have a multiprocess Python program that uses Queue to communicate
between processes. Recently we've seen some errors while blocked
waiting on Queue.get:

IOError: [Errno 4] Interrupted system call

What causes the exception?

Unix divides system calls up into "slow" and "fast". The difference is
how the react to signals.

Fast calls are things which are expected to return quickly. A canonical
example would get getuid(), which just returns a number it looks up in a
kernel data structure. Fast syscalls cannot be interrupted by signals.
If a signal arrives while a fast syscall is running, delivery of the
signal is delayed until after the call returns.

Slow calls are things which may take an indeterminate amount of time to
return. An example would be a read on a network socket; it will block
until a message arrives, which may be forever. Slow syscalls get
interrupted by signals. If a signal arrives while a slow syscall is
blocking, the call returns EINTR. This lets your code have a chance to
do whatever is appropriate, which might be clean up in preparation for
process shutdown, or maybe just ignore the interrupt and re-issue the
system call.

Here's a short python program which shows how this works (tested on
MacOS-10.6, but should be portable to just about any posix box):

-----------------------------------------------------
#!/usr/bin/env python

import socket
import signal
import os

def handler(sig_num, stack_frame):
return

print "my pid is", os.getpid()
signal.signal(signal.SIGUSR1, handler)
s = socket.socket(type=socket.SOCK_DGRAM)
s.bind(("127.0.0.1", 0))
s.recv(1024)
-----------------------------------------------------

Run this in one window. It should print out its process number, then
block on the recv() call. In another window, send it a SIGUSR1. You
should get something like:

play$ ./intr.py
my pid is 6969
Traceback (most recent call last):
File "./intr.py", line 14, in <module>
s.recv(1024)
socket.error: [Errno 4] Interrupted system call
Is it necessary to catch this exception
and manually retry the Queue operation? Thanks.

That's a deeper question which I can't answer. My guess is the
interrupted system call is the Queue trying to acquire a lock, but
there's no predicting what the signal is. I'm tempted to say that it's
a bug in Queue that it doesn't catch this exception internally, but
people who know more about the Queue implementation than I do should
chime in.
We have some Python 2.5 and 2.6 machines that have run this program
for many 1,000 hours with no errors. But we have one 2.5 machine and
one 2.7 machine that seem to get the error very often.

Yup, that's the nature of signal delivery race conditions in
multithreaded programs. Every machine will behave a little bit
differently, with no rhyme or reason. Google "undefined behavior" for
more details :) The whole posix signal delivery mechanism dates back
to the earliest Unix implementations, long before there were threads or
networks. At this point, it's got many layers of duct tape.
 
J

Jean-Paul Calderone

We have a multiprocess Python program that uses Queue to communicate
between processes.  Recently we've seen some errors while blocked
waiting on Queue.get:

IOError: [Errno 4] Interrupted system call

What causes the exception?  Is it necessary to catch this exception
and manually retry the Queue operation?  Thanks.

The exception is caused by a syscall returning EINTR. A syscall will
return EINTR when a signal arrives and interrupts whatever that
syscall
was trying to do. Typically a signal won't interrupt the syscall
unless you've installed a signal handler for that signal. However,
you can avoid the interruption by using `signal.siginterrupt` to
disable interruption on that signal after you've installed the
handler.

As for the other questions - I don't know, it depends how and why it
happens, and whether it prevents your application from working
properly.

Jean-Paul
 
P

Philip Winston

The exception is caused by a syscall returning EINTR.  A syscall will
return EINTR when a signal arrives and interrupts whatever that
syscall
was trying to do.  Typically a signal won't interrupt the syscall
unless you've installed a signal handler for that signal.  However,
you can avoid the interruption by using `signal.siginterrupt` to
disable interruption on that signal after you've installed the
handler.

As for the other questions - I don't know, it depends how and why it
happens, and whether it prevents your application from working
properly.

We did not try "signal.siginterrupt" because we were not installing
any signals, perhaps some library code is doing it without us knowing
about it. Plus I still don't know what signal was causing the
problem.

Instead based on Dan Stromberg's reply (http://code.activestate.com/
lists/python-list/595310/) I wrote a drop-in replacement for Queue
called RetryQueue which fixes the problem for us:

from multiprocessing.queues import Queue
import errno

def retry_on_eintr(function, *args, **kw):
while True:
try:
return function(*args, **kw)
except IOError, e:
if e.errno == errno.EINTR:
continue
else:
raise

class RetryQueue(Queue):
"""Queue which will retry if interrupted with EINTR."""
def get(self, block=True, timeout=None):
return retry_on_eintr(Queue.get, self, block, timeout)

As to whether this is a bug or just our own malignant signal-related
settings I'm not sure. Certainly it's not desirable to have your
blocking waits interrupted. I did see several EINTR issues in Python
but none obviously about Queue exactly:
http://bugs.python.org/issue1068268
http://bugs.python.org/issue1628205
http://bugs.python.org/issue10956

-Philip
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top