client-server parallellised number crunching

  • Thread starter Hans Georg Schaathun
  • Start date
H

Hans Georg Schaathun

I wonder if anyone has any experience with this ...

I try to set up a simple client-server system to do some number
crunching, using a simple ad hoc protocol over TCP/IP. I use
two Queue objects on the server side to manage the input and the output
of the client process. A basic system running seemingly fine on a single
quad-core box was surprisingly simple to set up, and it seems to give
me a reasonable speed-up of a factor of around 3-3.5 using four client
processes in addition to the master process. (If anyone wants more
details, please ask.)

Now, I would like to use remote hosts as well, more precisely, student
lab boxen which are rather unreliable. By experience I'd expect to
lose roughly 4-5 jobs in 100 CPU hours on average. Thus I need some
way of detecting lost connections and requeue unfinished tasks,
avoiding any serious delays in this detection. What is the best way to
do this in python?

It is, of course, possible for the master thread upon processing the
results, to requeue the tasks for any missing results, but it seems
to me to be a cleaner solution if I could detect disconnects and
requeue the tasks from the networking threads. Is that possible
using python sockets?

Somebody will probably ask why I am not using one of the multiprocessing
libraries. I have tried at least two, and got trapped by the overhead
of passing complex pickled objects across. Doing it myself has at least
helped me clarify what can be parallelised effectively. Now,
understanding the parallelisable subproblems better, I could try again,
if I can trust that these libraries can robustly handle lost clients.
That I don't know if I can.

Any ideas?
TIA
 
C

Chris Angelico

It is, of course, possible for the master thread upon processing the
results, to requeue the tasks for any missing results, but it seems
to me to be a cleaner solution if I could detect disconnects and
requeue the tasks from the networking threads.  Is that possible
using python sockets?

Somebody will probably ask why I am not using one of the multiprocessing
libraries.  I have tried at least two, and got trapped by the overhead
of passing complex pickled objects across.

If I were doing this, I would devise my own socket-layer protocol and
not bother with pickling objects at all. The two ends would read and
write the byte stream and interpret it as data.

But question: Why are you doing major number crunching in Python? On
your quad-core machine, recode in C and see if you can do the whole
job without bothering the unreliable boxen at all.

Chris Angelico
 
D

Dan Stromberg

I wonder if anyone has any experience with this ...

I try to set up a simple client-server system to do some number
crunching, using a simple ad hoc protocol over TCP/IP.  I use
two Queue objects on the server side to manage the input and the output
of the client process.  A basic system running seemingly fine on a single
quad-core box was surprisingly simple to set up, and it seems to give
me a reasonable speed-up of a factor of around 3-3.5 using four client
processes in addition to the master process.  (If anyone wants more
details, please ask.)

Now, I would like to use remote hosts as well, more precisely, student
lab boxen which are rather unreliable.  By experience I'd expect to
lose roughly 4-5 jobs in 100 CPU hours on average.  Thus I need some
way of detecting lost connections and requeue unfinished tasks,
avoiding any serious delays in this detection.  What is the best way to
do this in python?

It is, of course, possible for the master thread upon processing the
results, to requeue the tasks for any missing results, but it seems
to me to be a cleaner solution if I could detect disconnects and
requeue the tasks from the networking threads.  Is that possible
using python sockets?

Somebody will probably ask why I am not using one of the multiprocessing
libraries.  I have tried at least two, and got trapped by the overhead
of passing complex pickled objects across.  Doing it myself has at least
helped me clarify what can be parallelised effectively.  Now,
understanding the parallelisable subproblems better, I could try again,
if I can trust that these libraries can robustly handle lost clients.
That I don't know if I can.

You probably should assign a unique identifier to each piece of work,
and implement two timeouts - one on your socket, using select or poll
or similar, and one for the pieces of work based on the identifier.

http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html
 
D

Dan Stromberg

But question: Why are you doing major number crunching in Python? On
your quad-core machine, recode in C and see if you can do the whole
job without bothering the unreliable boxen at all.

Hmm, or try Cython or PyPy. ^_^

Here's that graph again:
http://stromberg.dnsalias.org/~dstromberg/backshift/performance/

I'd suggest that rewriting an entire software system in C because of
one inner loop, is overkill. 'better to rewrite just the inner loop
(if needed after profiling), and leave the rest in Python.
 
C

Chris Angelico

Hmm, or try Cython or PyPy.  ^_^

Sure, or that. I automatically think in terms of coding in C++ for
performance, but that's because I'm fluent in it. If you're not, then
yep, PyPy or Cython will do better.

Chris Angelico
 
H

Hans Georg Schaathun

But question: Why are you doing major number crunching in Python? On
your quad-core machine, recode in C and see if you can do the whole
job without bothering the unreliable boxen at all.

The reason is very simple. I cannot afford the time to code it in C.
Furthermore, the work is research and the system is experimental,
making the legibility of the code paramount.

I'd suggest that rewriting an entire software system in C because of
: one inner loop, is overkill. 'better to rewrite just the inner loop
: (if needed after profiling), and leave the rest in Python.

Well, that's the other reason. The most intense number crunching in
this part of the project is basic vector and matrix operations done in
numpy. AFAIU that means I use exactly the same libraries underneath as
I would have done in C. Please correct me if I am wrong.

I could run a profiler and squeeze out some performance by coding
another couple of components in C, but I cannot afford the programming
time whereas I can afford to waste the CPU cycles. And once I get the
client/server parallellisation working, I shall be able to reuse it at
negligible cost on other subproblems, whereas the profiling and C
reimplementation would cost almost as much time for every subsystem.

Does that answer your question, Chris?
 
H

Hans Georg Schaathun

Without knowledge of what you're doing it's hard to comment
: intelligently,

I need to calculate map( foobar, L ) where foobar() is a pure function
with no dependency on the global state, L is a list of tuples, each
containing two numpy arrays, currently 500-1000 floats each + a scalar
or two. The result is a pair of floats.

The foobar() function is sufficiently heavy to merit demonstratably
parallellisation.

The CPU-s I have available to spread the load further are not clustered.
They are prone to crash without warning and I do not have root access.
I don't have exclusive use. I do not even have physical access, so I
cannot use a liveCD. (They are, however, equipped with a batch queue
system (torque).)

: but I'd try something like CHAOS or OpenSSI to see if
: you can't get what you need for free, if that doesn't do it then try
: dropping a liveCD with Hadoop on it in each machine and running it
: that way. If that can't work, try MPI. If you've gotten that far and
: nothing does the trick then you're probably going to have to give
more
: details.

TANSTAFL :)
There is always the learning curve

If I understand it correctly, openSSI requires root access; is that
right? For CHAOS I need more details to be able to google; I found
a fractals toolbox, but that did not seem relevant :)

MPI I have tried before. Unless there is a new, massively more
sophisticated MPI library around now, I would certainly have to
do my own code to cope with lost clients.

Hadoop sounds intresting. I had encountered it before, but did not
think about it. However, the liveCD is clearly not an option. Thanks
for the tip; I'll read up on map-reduce at least.

:-- Hans Georg
 
G

geremy condra

 Without knowledge of what you're doing it's hard to comment
:  intelligently,

I need to calculate map( foobar, L ) where foobar() is a pure function
with no dependency on the global state, L is a list of tuples, each
containing two numpy arrays, currently 500-1000 floats each + a scalar
or two.  The result is a pair of floats.

The foobar() function is sufficiently heavy to merit demonstratably
parallellisation.

This sounds like a hadoop job, with the caveat that you still have to
get your objects across the network somehow. Have you tried xdrlib or
the struct module? I suspect either would save you some time.
The CPU-s I have available to spread the load further are not clustered.
They are prone to crash without warning and I do not have root access.
I don't have exclusive use.  I do not even have physical access, so I
cannot use a liveCD.  (They are, however, equipped with a batch queue
system (torque).)

Hmm. I guess I'd boil it down to this: if you have the ability to
install software on them, give hadoop a try. If, OTOH you can't
disturb normal lab operation at all and need a lot of CPU power, you
should probably start weighing what your time is worth against the
cost of firing up a few EC2 instances and being done with it- I use it
for cryptanalytic work with a similar structure all the time, and you
can get a hell of a cluster going over there for about $15-20 an hour.
If you're an academic (it sounds like you are) you may also be able to
use things like planetlab and emulab, which are free and reasonably
easy to use.
:                 but I'd try something like CHAOS or OpenSSI to see if
:  you can't get what you need for free, if that doesn't do it then try
:  dropping a liveCD with Hadoop on it in each machine and running it
:  that way.  If that can't work, try MPI. If you've gotten that far and
:  nothing does the trick then you're probably going to have to give
more
:  details.

TANSTAFL :)
There is always the learning curve

If I understand it correctly, openSSI requires root access; is that
right?  For CHAOS I need more details to be able to google; I found
a fractals toolbox, but that did not seem relevant :)

OpenSSI and CHAOS are both Single System Image clustering solutions-
they're pretty cool, but you pretty much need to be able to run a live
CD to make it worth your time.
MPI I have tried before.  Unless there is a new, massively more
sophisticated MPI library around now, I would certainly have to
do my own code to cope with lost clients.

Sandia labs has some neat work in this area, but if hadoop fits your
computational model it will be much easier on you in terms of
implementation.
Hadoop sounds intresting.  I had encountered it before, but did not
think about it.  However, the liveCD is clearly not an option.  Thanks
for the tip; I'll read up on map-reduce at least.

Np, hope it solves things for you ;)

Geremy Condra
 
T

Thomas Rachel

Am 26.04.2011 21:55, schrieb Hans Georg Schaathun:
Now, I would like to use remote hosts as well, more precisely, student
lab boxen which are rather unreliable. By experience I'd expect to
lose roughly 4-5 jobs in 100 CPU hours on average. Thus I need some
way of detecting lost connections and requeue unfinished tasks,
avoiding any serious delays in this detection. What is the best way to
do this in python?

As far as I understand, you acquire a job, send it to a remote host via
a socket and then wait for the answer. Is that correct?

In this case, I would put running jobs together with the respective
socket in a "running queue". If you detect a broken connection, put that
job into the "todo" queue again.

... if I could detect disconnects and
requeue the tasks from the networking threads. Is that possible
using python sockets?

Of course, why not? It might depend on some settings you set (keepalive
etc.); but generally you should get an exception when trying a
communication over a disconnected connection (over a disconnection? ;-))

When going over tne network, aviod pickling. Better use an own protocol.


Thomas
 
H

Hans Georg Schaathun

This sounds like a hadoop job, with the caveat that you still have to
: get your objects across the network somehow. Have you tried xdrlib or
: the struct module? I suspect either would save you some time.

Packing the objects up does not appear as a problem at the moment.
Pickled objects work just fine as long as I stick to relatively
simple data types and avoid any redundant information from my own
data structure. When the CPU-s run a about 85% with a naive com's
format, I am not going to waste any time to speed up com's.

The problem would occur when a client holding a task crashes.

: Hmm. I guess I'd boil it down to this: if you have the ability to
: install software on them, give hadoop a try. If, OTOH you can't
: disturb normal lab operation at all and need a lot of CPU power, you
: should probably start weighing what your time is worth against the
: cost of firing up a few EC2 instances and being done with it- I use it
: for cryptanalytic work with a similar structure all the time, and you
: can get a hell of a cluster going over there for about $15-20 an hour.

That is an interesting thought for future work, but with the trial
and error I do at the moment, the cost of EC2 would be quite
significant, especially when I have so many idle CPU-s at my finger tips.

: If you're an academic (it sounds like you are) you may also be able to
: use things like planetlab and emulab, which are free and reasonably
: easy to use.

I am.

Based on other answers, I should be able to hack together the necessary
error checking to handle lost clients with my current solution using
the standard socket and threading API-s. With the added benefit that
the familiarity with sockets and threads will be useful for teaching in
the foreseeable future.

It seems that all the tools you suggest will require much more reading
with less potential for reusing the knowledge. It will be very useful
when I can commit more resources for a longer period of time, and
certainly worth remembering for any fully-funded project in the future.
 
H

Hans Georg Schaathun

As far as I understand, you acquire a job, send it to a remote host via
: a socket and then wait for the answer. Is that correct?

That's correct. And the client initiates the connection. At the
moment, I use one thread per connection, and don't really want to
spend the time to figure out how to do it singlethreadedly.

: In this case, I would put running jobs together with the respective
: socket in a "running queue". If you detect a broken connection, put that
: job into the "todo" queue again.

Probably not a queue. Maybe a dictionary. After all I need to look up
the job later.

: Of course, why not? It might depend on some settings you set (keepalive
: etc.); but generally you should get an exception when trying a
: communication over a disconnected connection (over a disconnection? ;-))

There are several challenges there and more than one solution.
One concern of mine, which is not necessarily justified, is that I
do not quite know if I know of all the possible error cases.

I suppose your solution works if I run a relatively short timeout
on the server and send regular ping messages from the client.
 
C

Chris Angelico

 As far as I understand, you acquire a job, send it to a remote host via
:  a socket and then wait for the answer. Is that correct?

That's correct.  And the client initiates the connection.  At the
moment, I use one thread per connection, and don't really want to
spend the time to figure out how to do it singlethreadedly.

Threads work quite happily in Python if they're I/O bound, which they
will be if they're waiting for the client to do something.

ChrisA
 
H

Hans Georg Schaathun

> That's correct.  And the client initiates the connection.  At the
: > moment, I use one thread per connection, and don't really want to
: > spend the time to figure out how to do it singlethreadedly.
:
: Threads work quite happily in Python if they're I/O bound, which they
: will be if they're waiting for the client to do something.

Quite. I was referring to some tutorials and documentation recommending
to use non-blocking sockets and select() within a single thread. I
cannot say that I understand why, but I can imagine the benefit with
heavy traffic.
 
C

Chris Angelico

Quite.  I was referring to some tutorials and documentation recommending
to use non-blocking sockets and select() within a single thread.  I
cannot say that I understand why, but I can imagine the benefit with
heavy traffic.

Others will doubtless correct me if I'm wrong, but as I see it, select
is the better option if your code is mostly stateless, while threading
is easier if you want to maintain state and processing order. For
instance, with my MUD, I have one thread dedicated to each client; if
the client sends a series of commands, they will automatically queue
and be processed sequentially. On the other hand, if you run a server
where all connections are identical, it's easier to use select() and a
single thread; every time select returns, you process whichever
sockets can be processed, and move on. I don't know if the
considerations become different when you have insane numbers of
simultaneous connections; how well does your process cope with ten
thousand threads? a couple of million? In Python, it'll probably end
up pretty similar; chances are you won't be taking much advantage of
multiple CPUs/cores (because the threads will all be waiting for
socket read, or the single thread will mostly be waiting in select()),
so it's mainly a resource usage issue. Probably worth testing with
your actual code.

Chris Angelico
 
H

Hans Georg Schaathun

thousand threads? a couple of million? In Python, it'll probably end
: up pretty similar; chances are you won't be taking much advantage of
: multiple CPUs/cores (because the threads will all be waiting for
: socket read, or the single thread will mostly be waiting in select()),
: so it's mainly a resource usage issue. Probably worth testing with
: your actual code.

For my own application, the performance issue is rather negligible.
I don't have more than about 50 idle CPU-s which I can access easily,
and even if I had, it would always stop at 6-7000 functions calls
to evaluate. In the current test runs using 4 clients and one master
on a quad-core, the master never uses more than around 7% of a core,
and that includes some simple post-processing as well as com's.

In short, my philosophy is that it works, so why change it?

But, I am aware that some more technical adept programmers think
otherwise, and I am quite happy with that :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top