safety of timeout()

L

leon breedt

hi,

i've seen discussion previously regarding the safety of timeout,
however, as i understand it, this usage is safe (@client is connected
socket)

begin
timeout(30) do
line = @client.gets
end
rescue Timeout::Error
@client.close
end

am i wrong? all the real work is done elsewhere once complete requests
have been received.

leon
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: safety of timeout()"

|i've seen discussion previously regarding the safety of timeout,
|however, as i understand it, this usage is safe (@client is connected
|socket)

Define 'safety' first. Interpreter should not dump core by this
usage, but no exact 30 seconds guarantee.

matz.
 
L

leon breedt

Define 'safety' first. Interpreter should not dump core by this
usage, but no exact 30 seconds guarantee.
i'm not too worried about the precision, more just concerned about
cases that can interrupt the timeout.

like when a signal is received from OS while inside blocking code
wrapped by timeout().

how does the scope of trap() work for these situations? does trap()
work at the level of the real process? or current virtual thread?
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: safety of timeout()"

|like when a signal is received from OS while inside blocking code
|wrapped by timeout().

Signal is received immediately (at the latest safe point), and is
delivered to the main thread.

|how does the scope of trap() work for these situations? does trap()
|work at the level of the real process? or current virtual thread?

As stated above, trap works in the main thread.

matz.
 
B

Brian Candler

i've seen discussion previously regarding the safety of timeout,
however, as i understand it, this usage is safe (@client is connected
socket)

begin
timeout(30) do
line = @client.gets
end
rescue Timeout::Error
@client.close
end

am i wrong? all the real work is done elsewhere once complete requests
have been received.

That's true. The problem is that the timeout thread raises an exception
asynchronously. In the above case your main work is just @client.gets, but
if it were doing something more important, that work could be interrupted.
In particular, even work within an 'ensure' block is interrupted. It's a
common pattern to use 'ensure' to do cleanup work, but if the timeout occurs
at just the wrong time, the cleanup may not be completed.

This program demonstrates it:

---- 8< ------------------------
require 'timeout'
def bar
sleep(4)
raise "wibble" # optional
end

def foo
bar
ensure
puts "Cleanup started..."
sleep(2)
puts "Cleanup finished"
end

begin
timeout(5) do
foo
end
rescue Exception => e
p e
end
---- 8< ------------------------

You can try it both with and without the 'raise "wibble"' line. In both
cases the cleanup code in the 'ensure' block does not complete.

As a more realistic example, imagine some code like this:

timeout(30) do
File.open("mylog","a") do |f|
... do stuff
end
end

File.open with a block does several things:
1. open the file
2. yield it to the block
3. close the file in an 'ensure' section

If you were really unlucky, and the timeout occurred at exactly the same
time as the 'ensure' section were being executed, then the file could remain
open.

At least, that's what I understand to be the crux of the issue. In many
cases it's not going to be a major concern. You may be able to rewrite the
code to make it safer by pushing the timeouts down to the lowest possible
level. The above example could be rewritten safely as:

File.open("mylog","a") do |f|
timeout(30) do
... do stuff
end
end

Regards,

Brian.
 
P

Paul Brannan

At least, that's what I understand to be the crux of the issue. In many
cases it's not going to be a major concern. You may be able to rewrite the
code to make it safer by pushing the timeouts down to the lowest possible
level. The above example could be rewritten safely as:

File.open("mylog","a") do |f|
timeout(30) do
... do stuff
end
end

I like your description of the problem, Brian. I've tried to explain
this before and not been able to articulate it quite so well.

I would like to add one additional problem to your explanation, though:
timeout exceptions can prevent you from knowing whether or not an
operation actually succeeded. We have this problem with CORBA timeouts
in particular; a remote call is made, and the process on the other side
is running slowly. It does, however, complete the operation, just as
the timeout exception is being fired. So do I treat the operation as
success (since I know the request reached the other side, as I didn't
get a communications failure exception), or do I treat the operation as
failure (since I did get an exception and I did fail to get a return
value from the call)?

A solution I've used in the past has been to use an event loop and let
it handle the timeouts. The timeout then occurs only when the event
loop has control; when the timeout does occur, a proc is called that
handles the timeout. This proc may raise an exception or it may take
some other action.

For long-running operations, I periodically yield control to the event
loop when it is safe.

Paul
 
B

Brian Candler

I would like to add one additional problem to your explanation, though:
timeout exceptions can prevent you from knowing whether or not an
operation actually succeeded. We have this problem with CORBA timeouts
in particular; a remote call is made, and the process on the other side
is running slowly. It does, however, complete the operation, just as
the timeout exception is being fired. So do I treat the operation as
success (since I know the request reached the other side, as I didn't
get a communications failure exception), or do I treat the operation as
failure (since I did get an exception and I did fail to get a return
value from the call)?

I think that's a broader problem, and not specific to Ruby timeouts.

In the simplest case it's a pure race: with a 30 second timeout, what
happens if you get a response after 29.99 seconds or 30.01 seconds? In my
opinion, the borderline case doesn't really matter; you can assert that if a
Timeout::Error is fired, then you did not get a response in time. If the
server subsequently responds after 30.01 or 40 or 50 seconds, then tough; it
was too late, by definition.

However, I guess what you're really worried about is something more
fundamental: did the command actually complete on the CORBA server? Did the
server change state? Should I resubmit the command later?

You can see that this cannot be handled by timeouts alone. For example:

(1) the command might have completed after 27 seconds, but due to network
congestion, the reply did not get back until after 32 seconds. (=> the
command completed in time, but you were unable to detect this)

(2) the command might continue to execute after your timeout exception
fires, and complete after say 31 seconds. Even if the timeout exception then
goes on to drop the CORBA connection or try to chase the command with an
"abort" message of some sort, it's still a race which might be lost.

In order to be able to tell with certainty whether your command was accepted
AND acted upon, I believe you really need to use a sequence-number type of
mechanism, where both ends keep track of which messages they have sent and
have been acknowledged by the other side.

---> submit command N
... timeout
<-- response N "1234" (ignored by client, it was too late)

---> resubmit command N
<-- response N "1234" (from cache)

---> submit command N+1
<-- response N+1 "9876" etc.

Each end needs to keep track of the last command or response sent, so that
it can be resent if necessary. At the server side, if command N is received
a second time, the previous (remembered) response is resent; that's because
the command already executed the first time and changed the system state, so
attempting to perform the command again could fail. The client sending
command N+1 is an implicit acknowledgement that the response from command N
has been received, and no longer needs to be remembered.

Unfortunately, building a protocol like this *properly* is difficult, and if
done right you will end up with something which looks very much like TCP or
the X25 link layer. You need procedures to initialise the sequence numbers
and reset them in the case of gross errors, such as one end or the other
forgetting its sequence number. Ideally the sequence numbers and message
buffers should be persistent across application restarts (i.e. they are
stored in a database). Each "logical connection" between two endpoints needs
to be distinct with its own sequence number set. And you will need to choose
appropriate retransmission parameters.

This is a common problem though, and I'd certainly like to see a generic
encapsulation protocol which handles it properly. I think if done right, it
would work over multiple transport layers (e.g. HTTP POST, or even exchanges
of E-mail messages). If anyone knows of such a thing, I'd love to see it.

Regards,

Brian.
 
A

Ara.T.Howard

This is a common problem though, and I'd certainly like to see a generic
encapsulation protocol which handles it properly. I think if done right, it
would work over multiple transport layers (e.g. HTTP POST, or even exchanges
of E-mail messages). If anyone knows of such a thing, I'd love to see it.

check out spread:

" Spread is a toolkit that provides a high performance messaging service that
is resilient to faults across external or internal networks. Spread functions
as a unified message bus for distributed applications, and provides highly
tuned application-level multicast and group communication support. Spread
services range from reliable message passing to fully ordered messages with
delivery guarantees, even in case of computer failures and network partitions.
Spread is designed to encapsulate the challenging aspects of asynchronous
networks and enable the construction of scalable distributed applications,
allowing application builders to focus on the differentiating components of
their application."

- http://www.spread.org/
- http://raa.ruby-lang.org/project/rb_spread/

i have a patched version of the latest ruby binding.

regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
P

Paul Brannan

In order to be able to tell with certainty whether your command was
accepted AND acted upon, I believe you really need to use a
sequence-number type of mechanism, where both ends keep track of which
messages they have sent and have been acknowledged by the other side.

I don't think sequencing messages is sufficient to solve the problem.
A protocol like what you describe provides reliable messaging, but
not much more. For example, suppose I want to fail over to the backup
system if I time out -- I can do this, but I run the risk of performing
the operation more than once. At that point it becomes a question of
policy (can I afford to take that risk, or is that risk truly
necessary?).

If there were any easy solutions, then a lot of real-time researchers
would be out of work.

Paul
 
L

leon breedt

Hi,

Thanks for the detailed elaborations, folks. Good to know I'm not
unique in finding this non-trivial to do as correctly as possible :)

I don't think sequencing messages is sufficient to solve the problem.
A protocol like what you describe provides reliable messaging, but
not much more. For example, suppose I want to fail over to the backup
system if I time out -- I can do this, but I run the risk of performing
the operation more than once. At that point it becomes a question of
policy (can I afford to take that risk, or is that risk truly
necessary?).
In my case, I'm lucky enough that each operation requires only one
message from the client to my server, so it becomes a matter of being
able to safely determine the identity of the request so that a
subsequent request with the same identity would be discarded.

In my case, both the primary and backup system would use the same
RDBMS data source to keep track of what's been processed.

This server exists purely to prevent the problem of clients
accidentally submitting the same request twice, in the realm of credit
card payments.

Determining the identity correctly to allow valid second attempts
through is interesting. At the moment, I use serial numbers as well,
but I'm not entirely happy with this, as there was no negotiation
process to obtain these.

I also provide the guarantee to the client app that as soon as I've
acknowledged a request, its been persisted, and the server will
attempt to process it until it gets a deterministic OK/FAILED result.

So if the client times out before receiving my acknowledgement, and
they resubmit, they'll receive the in-progress error, and can send a
query to determine the status.

Leon
 
A

Ara.T.Howard

What does your patch fix/improve?

a little core dump ;-)


-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,879
Members
47,414
Latest member
GayleWedel

Latest Threads

Top