select() call and filedescriptor out of range in select error

K

k3xji

Hi all,

We have a select-based server written in Python. Occasionally, maybe
twice a month there occurs a weird problem, select() returns with
filedescriptor out of range in select() error. This is of course a
normal error and handled gracefully. Our policy is to take down few
users for select() to handle the next cycle. However, once this error
occurs, this also fails too:

self.__Sockets.remove(socket)

self.__Socket's is the very basic list of sockets we use in our IO
loop. The call fails with:
remove(x): x not in list

First of all, in our entire application there is no line of code like
remove(x), meaning there is no x variable. Second, the Exception shows
the line number containing above code. So
self.__Sockets.remove(socket) this fails with remove(x): x not in
list....

I cannot understand the problem. It happens in sporadic manner and it
feels that the ValueError of select() call somehow corrupts the List
structure itself in Python? Not sure if something like that is
possible.

Thanks in advance,
 
N

Ned Deily

We have a select-based server written in Python. Occasionally, maybe
twice a month there occurs a weird problem, select() returns with
filedescriptor out of range in select() error. This is of course a
normal error and handled gracefully. Our policy is to take down few
users for select() to handle the next cycle. However, once this error
occurs, this also fails too:

self.__Sockets.remove(socket)

self.__Socket's is the very basic list of sockets we use in our IO
loop. The call fails with:
remove(x): x not in list

First of all, in our entire application there is no line of code like
remove(x), meaning there is no x variable. Second, the Exception shows
the line number containing above code. So
self.__Sockets.remove(socket) this fails with remove(x): x not in
list....

I cannot understand the problem. It happens in sporadic manner and it
feels that the ValueError of select() call somehow corrupts the List
structure itself in Python? Not sure if something like that is
possible.

That error message is a generic exception message. It just means the
object to be removed is not in the list. For example:
l = [a, b]
a, b = 1, 2
l = [a, b]
l.remove(a)
l.remove(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: list.remove(x): x not in list

If the problem is that the socket object in question no longer exists,
you can protect your code there by enclosing the remove operation in a
try block, like:

try:
self.__Sockets.remove(socket)
except ValueError:
pass
 
S

Steven D'Aprano

Hi all,

We have a select-based server written in Python. Occasionally, maybe
twice a month there occurs a weird problem, select() returns with
filedescriptor out of range in select() error. This is of course a
normal error and handled gracefully. Our policy is to take down few
users for select() to handle the next cycle. However, once this error
occurs, this also fails too:

self.__Sockets.remove(socket)

self.__Socket's is the very basic list of sockets we use in our IO loop.
The call fails with:
remove(x): x not in list


Please show the *exact* error message, including the traceback, by
copying and pasting it. Do not retype it by hand, or summarize it, or put
it into your own words.


First of all, in our entire application there is no line of code like
remove(x), meaning there is no x variable.

Look at this example:
sockets = []
sockets.remove("Hello world")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: list.remove(x): x not in list


"x" is just a placeholder. It doesn't refer to an actual variable x.

Second, the Exception shows
the line number containing above code. So self.__Sockets.remove(socket)
this fails with remove(x): x not in list....

Exactly.


I cannot understand the problem. It happens in sporadic manner and it
feels that the ValueError of select() call somehow corrupts the List
structure itself in Python? Not sure if something like that is possible.

Anything is possible, but it's not likely. What's far more likely is that
you have a bug in your code, and that somehow, under rare circumstances,
it tries to remove something from a list that was never inserted into the
list. Or it tries to remove it twice.

My guess is something like this:

try:
socket = get_socket()
self._sockets.append(socket)
except SomeError:
pass
# later on
self._sockets.remove(socket)
 
J

James Mills

If the problem is that the socket object in question no longer exists,
you can protect your code there by enclosing the remove operation in a
try block, like:


The question that remains to be seen however is:

Why does your list contain dirty data ? Your code has likely removed
the socket object from the list before, why is it attempting to remove
it again ?

I would consider you re-look at your code's logic rather than patch
up the code with a "band-aid-solution".

cheers
James
 
K

k3xji

Please show the *exact* error message, including the traceback, by
copying and pasting it. Do not retype it by hand, or summarize it, or put
it into your own words.

Unfortunately this is not possible. The logging system I designed only
gives the following information, as we have millions of logs per-day
of custom exceptions I didnot include the full traceback.Here is only
what I have:

1448) 15/09/10 20:02:08 - [*] ERROR: Physical max client limit
reached. Please contact maintenance.filedescriptor out of range in
select()[scSocketServer.py:215:][Port:515]

The code generating the error is:

try:
self.__ReadersInCycle, self.__WritersInCycle,
e = \
select( self.__Sockets,
self.__WritersInCycle, [],
base.scOptions.scOPT_SELECT_TIMEOUT)

except ValueError, e:
LogError('Physical max client limit reached.'
\
' Please contact maintenance.'+ str(e))
self.scSvr_OnClientPhysicalLimitReached()
#define a policy here
continue
First of all, in our entire application there is no line of code like
remove(x), meaning there is no x variable.

Look at this example:
sockets = []
sockets.remove("Hello world")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: list.remove(x): x not in list

Ok. Thanks.
Anything is possible, but it's not likely. What's far more likely is that
you have a bug in your code, and that somehow, under rare circumstances,
it tries to remove something from a list that was never inserted into the
list. Or it tries to remove it twice.

My guess is something like this:

try:
    socket = get_socket()
    self._sockets.append(socket)
except SomeError:
    pass
# later on
self._sockets.remove(socket)

Hmm.. Might be, but inside the self.__Sockets list there is the
ListenSocket() which is the real listening socket. Naturally, I am
using it in the read list of select() on every server cycle. The weird
thing is that the ListenSocket itself is throwing the "not in list"
exception, too! And one thing I am sure is that I have not written any
kind of code that removes the Listen socket from the List, that is
just impossible. Additionaly, there are very few places that I
traverse the __Sockets list for optimization. The only places I delete
something from the __Sockets list:

1) a user disconnects (normal disconnect, authentication or ping
timeout)
3) server is being stopped or restarted

Other than that there is not access to that variable from outside
objects, as can be seen it is also private. And please keep in mind
that this bug is there for about a year, so many code reviews have
passed successfully without noticing the type of error you are
suggesting.

And more information on system: I am running Python 2.4 on CentOS.

By the way, through digging the logs and system, it turns out
select(..) is hitting the per-process FD limit. Although the system
wide ulimit is unlimited, I think Python "selectmodule.c" enforces
the rule to 1024. I am getting the error after hitting that limit and
somehow as I just explained the __ListenSocket is being removed from
the read list which causes it to be lost and Server instance is just
lost forever. Putting a try..except to that code and re-init server
port is a solution but I guess a bad one, because I will have not
found the root cause.

Thanks in advance,
 
S

Steven D'Aprano

The question that remains to be seen however is:

Why does your list contain dirty data ? Your code has likely removed the
socket object from the list before, why is it attempting to remove it
again ?

I would consider you re-look at your code's logic rather than patch up
the code with a "band-aid-solution".

Well said.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,254
Members
46,849
Latest member
Fira

Latest Threads

Top