J
jamesnichols3
I have a Java application that makes a large number of outbound
webservice calls over HTTP/TCP. The hosts contacted are a fixed set
of about 2000 hosts and a web service call is made to each of them
approximately every 5 mintues by a pool of 200 Java threads. Over
time, on average a percentage of these hosts are unreachable for one
reason or another, so there is a persistent count of sockets in the
SYN_SENT state in the range of about 60-80. This is fine, as these
failed connection attempts eventually time out.
However, after approximately 38 hours of operation, all outbound
connection attemtps get stuck in the SYN_SENT state. It happens
instantaneously, where we go from the baseline of about 60-80 sockets
in SYN_SENT to a count of 200 (corresponding to the # of java threads
that make these calls). I've tried several things to clear this
problem up, including:
1) Restarting the Java application
2) ip route flush cache
3) Start/stop networking
4) rmmod/insmod the kernel driver for the NIC
5) Tuning of /proc/sys/net/ipv4/tcp_syn_retries
6) Disabling /proc/sys/net/ipv4/tcp_syncookies
However, after each of these countermeasures, the outbound connections
still get stuck in SYN_SENT. During this time, I am still able to SSH
to the box and run wget www.google.com, etc, so the problem appears to
be specific to the hosts that I'm accessing via the webservices. The
only thing that makes this problem go away is to restart the entire
Linux box. Once I do this and restart my application it works
perfectly fine... for 38 hours until it occurs again.
I'm running kernel 2.6.18 on RedHat, but have had this problem occur
on other kernel versions. I've also had this problem occur on
different boxes, NICs, routers, co-location facilities, and several
other variables. The only thing in common is my application and the
fact that it is Linux, so I have to believe that my application is
causing something wierd in the kernel, since an application restart
doesn't help.
Any ideas?
webservice calls over HTTP/TCP. The hosts contacted are a fixed set
of about 2000 hosts and a web service call is made to each of them
approximately every 5 mintues by a pool of 200 Java threads. Over
time, on average a percentage of these hosts are unreachable for one
reason or another, so there is a persistent count of sockets in the
SYN_SENT state in the range of about 60-80. This is fine, as these
failed connection attempts eventually time out.
However, after approximately 38 hours of operation, all outbound
connection attemtps get stuck in the SYN_SENT state. It happens
instantaneously, where we go from the baseline of about 60-80 sockets
in SYN_SENT to a count of 200 (corresponding to the # of java threads
that make these calls). I've tried several things to clear this
problem up, including:
1) Restarting the Java application
2) ip route flush cache
3) Start/stop networking
4) rmmod/insmod the kernel driver for the NIC
5) Tuning of /proc/sys/net/ipv4/tcp_syn_retries
6) Disabling /proc/sys/net/ipv4/tcp_syncookies
However, after each of these countermeasures, the outbound connections
still get stuck in SYN_SENT. During this time, I am still able to SSH
to the box and run wget www.google.com, etc, so the problem appears to
be specific to the hosts that I'm accessing via the webservices. The
only thing that makes this problem go away is to restart the entire
Linux box. Once I do this and restart my application it works
perfectly fine... for 38 hours until it occurs again.
I'm running kernel 2.6.18 on RedHat, but have had this problem occur
on other kernel versions. I've also had this problem occur on
different boxes, NICs, routers, co-location facilities, and several
other variables. The only thing in common is my application and the
fact that it is Linux, so I have to believe that my application is
causing something wierd in the kernel, since an application restart
doesn't help.
Any ideas?