After a while all outbound connections get stuck in SYN_SENT

J

jamesnichols3

I have a Java application that makes a large number of outbound
webservice calls over HTTP/TCP. The hosts contacted are a fixed set
of about 2000 hosts and a web service call is made to each of them
approximately every 5 mintues by a pool of 200 Java threads. Over
time, on average a percentage of these hosts are unreachable for one
reason or another, so there is a persistent count of sockets in the
SYN_SENT state in the range of about 60-80. This is fine, as these
failed connection attempts eventually time out.

However, after approximately 38 hours of operation, all outbound
connection attemtps get stuck in the SYN_SENT state. It happens
instantaneously, where we go from the baseline of about 60-80 sockets
in SYN_SENT to a count of 200 (corresponding to the # of java threads
that make these calls). I've tried several things to clear this
problem up, including:

1) Restarting the Java application
2) ip route flush cache
3) Start/stop networking
4) rmmod/insmod the kernel driver for the NIC
5) Tuning of /proc/sys/net/ipv4/tcp_syn_retries
6) Disabling /proc/sys/net/ipv4/tcp_syncookies

However, after each of these countermeasures, the outbound connections
still get stuck in SYN_SENT. During this time, I am still able to SSH
to the box and run wget www.google.com, etc, so the problem appears to
be specific to the hosts that I'm accessing via the webservices. The
only thing that makes this problem go away is to restart the entire
Linux box. Once I do this and restart my application it works
perfectly fine... for 38 hours until it occurs again.

I'm running kernel 2.6.18 on RedHat, but have had this problem occur
on other kernel versions. I've also had this problem occur on
different boxes, NICs, routers, co-location facilities, and several
other variables. The only thing in common is my application and the
fact that it is Linux, so I have to believe that my application is
causing something wierd in the kernel, since an application restart
doesn't help.

Any ideas?
 
O

Owen Jacobson

I have a Java application that makes a large number of outbound
webservice calls over HTTP/TCP. The hosts contacted are a fixed set
of about 2000 hosts and a web service call is made to each of them
approximately every 5 mintues by a pool of 200 Java threads. Over
time, on average a percentage of these hosts are unreachable for one
reason or another, so there is a persistent count of sockets in the
SYN_SENT state in the range of about 60-80. This is fine, as these
failed connection attempts eventually time out.

However, after approximately 38 hours of operation, all outbound
connection attemtps get stuck in the SYN_SENT state. It happens
instantaneously, where we go from the baseline of about 60-80 sockets
in SYN_SENT to a count of 200 (corresponding to the # of java threads
that make these calls). I've tried several things to clear this
problem up, including:

1) Restarting the Java application
2) ip route flush cache
3) Start/stop networking
4) rmmod/insmod the kernel driver for the NIC
5) Tuning of /proc/sys/net/ipv4/tcp_syn_retries
6) Disabling /proc/sys/net/ipv4/tcp_syncookies

However, after each of these countermeasures, the outbound connections
still get stuck in SYN_SENT. During this time, I am still able to SSH
to the box and run wgetwww.google.com, etc, so the problem appears to
be specific to the hosts that I'm accessing via the webservices. The
only thing that makes this problem go away is to restart the entire
Linux box. Once I do this and restart my application it works
perfectly fine... for 38 hours until it occurs again.

I'm running kernel 2.6.18 on RedHat, but have had this problem occur
on other kernel versions. I've also had this problem occur on
different boxes, NICs, routers, co-location facilities, and several
other variables. The only thing in common is my application and the
fact that it is Linux, so I have to believe that my application is
causing something wierd in the kernel, since an application restart
doesn't help.

Any ideas?

SYN_SENT means the local host has transmitted a SYN requesting the
creation of a connection but has not yet received either an RST
response indicating that nothing's listening nor a ACK SYN response
indicating that something *is* listening. Probable culprits would be,
in roughly descending order,

- firewall problems,
- the remote host has gone down or is not responding to network
traffic,
- firewall problems,
- misconfiguration somewhere in between your machine and the remote
host, and
- firewall problems.

Dig up a copy of Wireshark and watch the actual network traffic
between your machine and the host you're calling services on to see
which of these is likely. If possible run it from both inside and
outside your own firewall so you can see if your firewall is blocking
the returning ACK+SYN or even the outgoing SYN or not.
 
J

jamesnichols3

SYN_SENT means the local host has transmitted a SYN requesting the
creation of a connection but has not yet received either an RST
response indicating that nothing's listening nor a ACK SYN response
indicating that something *is* listening. Probable culprits would be,
in roughly descending order,

- firewall problems,
- the remote host has gone down or is not responding to network
traffic,
- firewall problems,
- misconfiguration somewhere in between your machine and the remote
host, and
- firewall problems.

Dig up a copy of Wireshark and watch the actual network traffic
between your machine and the host you're calling services on to see
which of these is likely. If possible run it from both inside and
outside your own firewall so you can see if your firewall is blocking
the returning ACK+SYN or even the outgoing SYN or not.

Hi,

I've had this problem over multiple types of firewall devices, versions, and
configurations. It's not possible for me to packet capture outside of the
firewall. Unfortunately, the data rate is such that it's nearly impossible
to gain many insights from the internal packet capture that I can take. This
problem is occuring when connecting to 1000's of hosts spread out all over
the internet, so it's highly unlikely that they are all going down at once or
there is some misconfiguration that occurs- every 38 hours. It is indicative
of something systematic happening in the OS, but I can't figure out what it
is.
 
O

Owen Jacobson

Hi,

I've had this problem over multiple types of firewall devices, versions, and
configurations. It's not possible for me to packet capture outside of the
firewall. Unfortunately, the data rate is such that it's nearly impossible
to gain many insights from the internal packet capture that I can take. This
problem is occuring when connecting to 1000's of hosts spread out all over
the internet, so it's highly unlikely that they are all going down at once or
there is some misconfiguration that occurs- every 38 hours. It is indicative
of something systematic happening in the OS, but I can't figure out what it
is.

Maybe the NIC sucks.
 
J

Jim Garrison

jamesnichols3 said:
It's happend with a couple of different NICs too :(

You say you can't capture packets outside the firewall...
how about between the firewall and your failing system?
Can you insert a cheap Ethernet hub (NOT a switch) and
attach a second Linux system running Wireshark to the
hub? The first step in debugging is deciding where the
response SYN-ACK packets are being lost: either outside
the firewall or between the firewall and your box. If
you see the responses on the monitor system the problem
is in your box. If you don't, the problem is upstream.

Note that the setup I described is not equivalent to
capturing packets on your failing box. It might be
dropping the packets.
 
N

Nigel Wade

jamesnichols3 said:
Hi,

I've had this problem over multiple types of firewall devices, versions, and
configurations. It's not possible for me to packet capture outside of the
firewall. Unfortunately, the data rate is such that it's nearly impossible
to gain many insights from the internal packet capture that I can take. This
problem is occuring when connecting to 1000's of hosts spread out all over
the internet, so it's highly unlikely that they are all going down at once or
there is some misconfiguration that occurs- every 38 hours. It is indicative
of something systematic happening in the OS, but I can't figure out what it
is.

Are you running iptables on the system in question? What happens if you disable
it?

It's just possible that the state table is filling up so ESTABLISHED,RELATED
packets are no longer being accepted. This would result in the SYN,ACK response
from the remote end being dropped, and a socket hung in the SYN_SENT state.

You can look at the iptables state table using some esoteric magic incantation,
which I can't remember offhand. I should have it in my firewall notes, I'll try
to locate it (it's not something I have to do very often...)
 
M

Martin Gregorie

jamesnichols3 said:
However, after approximately 38 hours of operation, all outbound
connection attemtps get stuck in the SYN_SENT state. It happens
instantaneously, where we go from the baseline of about 60-80 sockets
in SYN_SENT to a count of 200 (corresponding to the # of java threads
that make these calls). I've tried several things to clear this
problem up, including:

1) Restarting the Java application
>
Are you saying that all sockets are immediately stuck, i.e., no
successful connections at all, after you restart the application?

If so, the problem, as others have said, has to be outside your
application.

OTOH, if it runs for a while and then hangs up again have you tried
periodically closing and re-opening each socket in case something rots
it after a large number of connect/disconnect cycles?
 
J

jamesnichols3

Are you running iptables on the system in question? What happens if you disable
it?

It's just possible that the state table is filling up so ESTABLISHED,RELATED
packets are no longer being accepted. This would result in the SYN,ACK response
from the remote end being dropped, and a socket hung in the SYN_SENT state.

You can look at the iptables state table using some esoteric magic incantation,
which I can't remember offhand. I should have it in my firewall notes, I'll try
to locate it (it's not something I have to do very often...)

Yes, I am running iptables. My ip_conntrack_max is set to 65K, so I don't
think I'm filling that up. I can't really disable it during actual
application usage... what I have done is:

1) stop the application
2) run /etc/init.d/iptables stop
3) run /etc/init.d/iptables start
4) Restart the application

And all the outbound connections get stuck in SYN_SENT
 
J

jamesnichols3

Are you saying that all sockets are immediately stuck, i.e., no
successful connections at all, after you restart the application?

If so, the problem, as others have said, has to be outside your
application.

OTOH, if it runs for a while and then hangs up again have you tried
periodically closing and re-opening each socket in case something rots
it after a large number of connect/disconnect cycles?

Yes, when I restart the application all of the outbound connection
immediately get stuck in SYN_SENT. One or two might make it out, but 99% get
stuck in SYN_SENT until all of the threads responsible for outbound
connections are stuck waiting on sockets in this state. The sockets are open
and closed by each of the 200 threads at least every 5 minutes or so.
 
N

Nigel Wade

jamesnichols3 said:
Yes, I am running iptables. My ip_conntrack_max is set to 65K, so I don't
think I'm filling that up.

Unlikely yes, but why guess? You might be the target of a SYN flood DoS attack,
or have an errant network application or appliance. Take a look
at /proc/net/ip_conntrack
I can't really disable it during actual
application usage...

It would only need to be for a few seconds whilst you ran a packet capture. You
want to be sure that all packets received on the network interface are visible
to wireshark. Is the machine in question plugged into a managed switch? If so
you might be able to set one port to monitoring mode and see all traffic on the
switch allowing you to see traffic to that machine externally.
what I have done is:

1) stop the application
2) run /etc/init.d/iptables stop
3) run /etc/init.d/iptables start
4) Restart the application

And all the outbound connections get stuck in SYN_SENT

When in this state what does your iptables state table look like? For each
connection in the SYN_SENT state you should have an equivalent entry in the
ip_conntrack state table. When you start a new connection does it go into the
state table, and in what state? Does this affect other network applications? If
you run wireshark and capture packets whilst a new connection is being
attempted, what does that show?

It really sounds like you have a problem with iptables, or an external
networking appliance. Something is dropping the outbound SYN, or the SYN/ACK
replies. I doubt very much that it's due to Java.
 
J

jamesnichols3

Unlikely yes, but why guess? You might be the target of a SYN flood DoS attack,
or have an errant network application or appliance. Take a look
at /proc/net/ip_conntrack

Yes, there are only a few thousand entries in ip_conntrack.
It would only need to be for a few seconds whilst you ran a packet capture. You
want to be sure that all packets received on the network interface are visible
to wireshark. Is the machine in question plugged into a managed switch? If so
you might be able to set one port to monitoring mode and see all traffic on the
switch allowing you to see traffic to that machine externally.
what I have done is:
[quoted text clipped - 4 lines]
And all the outbound connections get stuck in SYN_SENT

When in this state what does your iptables state table look like? For each
connection in the SYN_SENT state you should have an equivalent entry in the
ip_conntrack state table. When you start a new connection does it go into the
state table, and in what state? Does this affect other network applications? If
you run wireshark and capture packets whilst a new connection is being
attempted, what does that show?

The connection do end up in ip_conntrack, in SYN_SENT state. This only
effects the outbound webservices traffic. I can ssh into/out of the box and
wget www.google.com, but can't contact the webserivce hosts, even using
wget/telnet/etc. It's something at the OS level, so I'm pretty sure that
Java's usage of networking is doing something at the OS level over time.

It really sounds like you have a problem with iptables, or an external
networking appliance. Something is dropping the outbound SYN, or the SYN/ACK
replies. I doubt very much that it's due to Java.

I agree, I think that it is the workload caused by Java that is triggering
something in the OS. It really can't be a router or firewall, as I have
completely rebuilt this part of the infrastructure several times over the
past several years and the problem is still there. The only thing that makes
the problem go away is rebooting the box.
 
M

Martin Gregorie

jamesnichols3 said:
Yes, when I restart the application all of the outbound connection
immediately get stuck in SYN_SENT. One or two might make it out, but 99% get
stuck in SYN_SENT until all of the threads responsible for outbound
connections are stuck waiting on sockets in this state. The sockets are open
and closed by each of the 200 threads at least every 5 minutes or so.
In that case I agree with everybody else: the problem is most probably
external to the Java app. However I do have an additional suggestion:

It would be useful to know WHERE the stoppage happens. 'traceroute' may
help here. Running it with the -p option lets you trace the route to a
specific port at the destination and the -T uses SYN to do the probing.

Try running "traceroute -T -p=port host" against one of your usual
targets when nothing is stuck and before you start your application.
After that host becomes stuck stop your application and try traceroute
again with the same command line arguments and see how far the second
traceroute gets before it blocks.
 
N

Nigel Wade

jamesnichols3 said:
Unlikely yes, but why guess? You might be the target of a SYN flood DoS attack,
or have an errant network application or appliance. Take a look
at /proc/net/ip_conntrack

Yes, there are only a few thousand entries in ip_conntrack.
It would only need to be for a few seconds whilst you ran a packet capture. You
want to be sure that all packets received on the network interface are visible
to wireshark. Is the machine in question plugged into a managed switch? If so
you might be able to set one port to monitoring mode and see all traffic on the
switch allowing you to see traffic to that machine externally.
what I have done is:
[quoted text clipped - 4 lines]
And all the outbound connections get stuck in SYN_SENT

When in this state what does your iptables state table look like? For each
connection in the SYN_SENT state you should have an equivalent entry in the
ip_conntrack state table. When you start a new connection does it go into the
state table, and in what state? Does this affect other network applications? If
you run wireshark and capture packets whilst a new connection is being
attempted, what does that show?

The connection do end up in ip_conntrack, in SYN_SENT state. This only
effects the outbound webservices traffic. I can ssh into/out of the box and
wget www.google.com, but can't contact the webserivce hosts, even using
wget/telnet/etc. It's something at the OS level, so I'm pretty sure that
Java's usage of networking is doing something at the OS level over time.

It's unlikely to be at the OS level. The OS won't differentiate between Java
opening a socket and ssh opening a socket. Also, it almost certainly not at the
application level, the SYN has been sent so the request to open the socket has
got to the transport layer.

What your diagnostics show is that a SYN has been sent by the transport layer of
the network stack (this has been detected by iptables, within the kernel).
Where this has gone to you haven't yet established. Without external
diagnostics you are pretty much flying blind. You need to talk to your network
support people and ask them to help you find out what is going on. Either the
SYN is not being delivered to the remote server, or the response is not getting
back to your system. Either way it's a networking problem either at a very low
level in your system, or a routing/firewalling problem between your system and
the remote machine. You need to establish where the SYN, or SYN/ACK response,
are disappearing.
I agree, I think that it is the workload caused by Java that is triggering
something in the OS. It really can't be a router or firewall, as I have
completely rebuilt this part of the infrastructure several times over the
past several years and the problem is still there. The only thing that makes
the problem go away is rebooting the box.

There's a possibility that you are falling foul of some resource limit in the
networking. If previous sockets haven't been fully closed then the remote
server (or its firewall etc.) may not be allowing you to establish a new one.
Re-booting will probably result in a disconnect at the remote end. Have a look
at your netstat and see what network connections there are existing between
your machine and those webservice providers.
 
J

John W. Kennedy

Nigel said:
It's unlikely to be at the OS level. The OS won't differentiate between Java
opening a socket and ssh opening a socket.

Windows with the usual sort of anti-virus software will (or will seem to).


--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
 
J

jamesnichols3

I figured out a countermeasure. When the 38 hour limit is hit and the
connections start to get stuck in SYN_SENT, I disabled tcp_sack in the linux
kernel. Almost instantly, the SYN_SENT connections cleared up and
connectivity was restored. I beleive there is a bug in the tcp_sack
implementation and based on my application workload, a memory structure or
something is being filled up after 38 hours and causing this behavior.
 
N

Nigel Wade

John said:
Windows with the usual sort of anti-virus software will (or will seem to).

But the OP is using iptables (so I presume Linux), which is a simple packet
level filter. It knows nothing about what application generated the packet.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top