A
Amit N
Hi guys,
I tend to ramble, and I am afraid none of you busy experts will bother
reading my long post, so I will try to summarize it first:
1. I have a script that processes ~10GB of data daily, and runs for a long
time that I need to parallelize on a multicpu/multicore system. I am trying
to decide on a module/toolkit that would help me create a multiprocessing
solution but there are so many of them that I can't decide what to use. I am
looking for a cross platform solution. Although right now it has to work in
windows first, so many of the fork based modules are out. I am hoping people
with experience using any of these would chime in with tips. The main thing
I would look for in a toolkit is maturity and no extra dependencies. Plus a
wide user community is always good.
POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused
2. The processing involves multiple steps that each input file has to go
through. I am trying to decide between a batch mode design and a pipelined
design for concurrency. In the batched design, all files will be processed
on one processing step(in parallel) before the next step is started. In a
pipelined design, each file will be taken through all steps to the end. So
multiple files will be in parallel pipelines at the same time. I can't
decide which is better. I guess I am asking for experienced eyes to take a
look at the alternatives, for things that I, making my very first concurrent
design, won't see.
DETAILS:
I have been trying to choose a design for this project but am striken by my
usual case of analysis paralysis.
I had decided to learn Python about 3 weeks ago specifically for this
project, as it needed parsing and text processing, not realizing that I
would need concurrency. I am having the same trouble in deciding which
parser generator to use, but I will ask about parsing in a separate thread
to keep this focused.
It was slow, so I tried to run a multithreaded version, naively expecting a
2x speedup. I barely got a 5% improvement and only then learned about the
GIL. I guess I still haven't got too much time invested in this, so I can
still switch to another language. I am not sure which other scripting
languages have real multithreading? Perl? But I had chosen Python over Perl
for readability and maintainability and am not ready to give that up yet. I
know about stackless/Ironpython/Jython but I want to stick to CPython. So I
am going to try to figure this out.
Even after deciding to go for a SMP solution, I still don't know which
toolkit to use. The subprocess module should allow spawning new processes,
but I am not sure how to get status/error codes back from those? I guess
this is why people made those parallel processing modules that might help by
taking care of these things. I think my application is fairly simple and
should be easy to SMP.
THE TASK:
About 800+ 10-15MB files are generated daily that need to be processed. The
processing consists of different steps that the files must go through:
-Uncompress
-FilterA
-FilterB
-Parse
-Possibly compress parsed files for archival
All files have to be run through each of the two filters. The two filters
are independent of each other and produce output files that need separate
parsers. So they can in fact run in parallel, and so can the subsequent
parsers. Furthermore, multiple files can be running in parallel inside each
step. Eg. 4 files being uncompressed at the same time. I am using the python
library for uncompressing and will be doing the parsing in Python too. But
the two filters are external console programs that I spawn in the system
shell with subprocess.call(). I guess I can forget about communicating with
those?
The first method that came to mind was to finish each step on all files
before going to the next. So all files are uncompressed first, using
multiple processes in parallel. Then all files are filtered in parallel,
etc. I guess I would need some sort of queuing system here, to submit files
to the CPUs properly?
The other way could be to have each individual file run through all the
steps and have multiple such "pipelines" running simultaneously in parallel.
It feels like this method will lose cache performance because all the code
for all the steps will be loaded at the same time, but I am not sure if I
should be worrying about that. This will have the advantage of "Fast
First-Out" which means that something waiting for the results of processing
won't have to wait till the very end. They can start receiving data
incrementally from the start(kind of streaming?). Pipelined mode may also
help to rerun an individual file quickly in case it had an error.
So whats the better method?
EVALUATIONS:
POSH - Doesn't seem mature, was supposed to be proof of concept only. People
have reported Bugs/Problems using it. POSIX Only.
delegate/forkmap/pprocess - fork based, POSIX only
ParallelPython - Seems to meet all criteria, and is cross platform. I will
be trying this one first.
remoteD - Claims to be platform independent, but I don't think so. Code
shows os.fork only. Last updated 2004 v0.8
processing - Is in beta V0.33 but looks promising and is cross platform.
Emulates processes as threads. http://www.python.org/pypi/processing
MPI based modules(probably overkill for my application):
pyPar - Mature, cross platform. Has a dependency on Numeric Python + needs a
C compiler.
pyMpi - POSIX only . Alpha status. From lawrence livermore labs. It modifies
the interpreter itself to make it multi-noded.
mpi4py - ? another MPI implementation.
LINKS & DISCUSSIONS
http://wiki.python.org/moin/ParallelProcessing
http://blog.ianbicking.org/gil-of-doom.html
http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html
http://groups.google.com/group/comp.lang.python/browse_thread/thread/1f5d927d34f8f323/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/332083cdc8bc44b/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/13da24f2d6dc24a9/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/f822ec289f30b26a/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/902dbddfc31b8891
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d8fa9ad770c17c70/
I tend to ramble, and I am afraid none of you busy experts will bother
reading my long post, so I will try to summarize it first:
1. I have a script that processes ~10GB of data daily, and runs for a long
time that I need to parallelize on a multicpu/multicore system. I am trying
to decide on a module/toolkit that would help me create a multiprocessing
solution but there are so many of them that I can't decide what to use. I am
looking for a cross platform solution. Although right now it has to work in
windows first, so many of the fork based modules are out. I am hoping people
with experience using any of these would chime in with tips. The main thing
I would look for in a toolkit is maturity and no extra dependencies. Plus a
wide user community is always good.
POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused
2. The processing involves multiple steps that each input file has to go
through. I am trying to decide between a batch mode design and a pipelined
design for concurrency. In the batched design, all files will be processed
on one processing step(in parallel) before the next step is started. In a
pipelined design, each file will be taken through all steps to the end. So
multiple files will be in parallel pipelines at the same time. I can't
decide which is better. I guess I am asking for experienced eyes to take a
look at the alternatives, for things that I, making my very first concurrent
design, won't see.
DETAILS:
I have been trying to choose a design for this project but am striken by my
usual case of analysis paralysis.
I had decided to learn Python about 3 weeks ago specifically for this
project, as it needed parsing and text processing, not realizing that I
would need concurrency. I am having the same trouble in deciding which
parser generator to use, but I will ask about parsing in a separate thread
to keep this focused.
It was slow, so I tried to run a multithreaded version, naively expecting a
2x speedup. I barely got a 5% improvement and only then learned about the
GIL. I guess I still haven't got too much time invested in this, so I can
still switch to another language. I am not sure which other scripting
languages have real multithreading? Perl? But I had chosen Python over Perl
for readability and maintainability and am not ready to give that up yet. I
know about stackless/Ironpython/Jython but I want to stick to CPython. So I
am going to try to figure this out.
Even after deciding to go for a SMP solution, I still don't know which
toolkit to use. The subprocess module should allow spawning new processes,
but I am not sure how to get status/error codes back from those? I guess
this is why people made those parallel processing modules that might help by
taking care of these things. I think my application is fairly simple and
should be easy to SMP.
THE TASK:
About 800+ 10-15MB files are generated daily that need to be processed. The
processing consists of different steps that the files must go through:
-Uncompress
-FilterA
-FilterB
-Parse
-Possibly compress parsed files for archival
All files have to be run through each of the two filters. The two filters
are independent of each other and produce output files that need separate
parsers. So they can in fact run in parallel, and so can the subsequent
parsers. Furthermore, multiple files can be running in parallel inside each
step. Eg. 4 files being uncompressed at the same time. I am using the python
library for uncompressing and will be doing the parsing in Python too. But
the two filters are external console programs that I spawn in the system
shell with subprocess.call(). I guess I can forget about communicating with
those?
The first method that came to mind was to finish each step on all files
before going to the next. So all files are uncompressed first, using
multiple processes in parallel. Then all files are filtered in parallel,
etc. I guess I would need some sort of queuing system here, to submit files
to the CPUs properly?
The other way could be to have each individual file run through all the
steps and have multiple such "pipelines" running simultaneously in parallel.
It feels like this method will lose cache performance because all the code
for all the steps will be loaded at the same time, but I am not sure if I
should be worrying about that. This will have the advantage of "Fast
First-Out" which means that something waiting for the results of processing
won't have to wait till the very end. They can start receiving data
incrementally from the start(kind of streaming?). Pipelined mode may also
help to rerun an individual file quickly in case it had an error.
So whats the better method?
EVALUATIONS:
POSH - Doesn't seem mature, was supposed to be proof of concept only. People
have reported Bugs/Problems using it. POSIX Only.
delegate/forkmap/pprocess - fork based, POSIX only
ParallelPython - Seems to meet all criteria, and is cross platform. I will
be trying this one first.
remoteD - Claims to be platform independent, but I don't think so. Code
shows os.fork only. Last updated 2004 v0.8
processing - Is in beta V0.33 but looks promising and is cross platform.
Emulates processes as threads. http://www.python.org/pypi/processing
MPI based modules(probably overkill for my application):
pyPar - Mature, cross platform. Has a dependency on Numeric Python + needs a
C compiler.
pyMpi - POSIX only . Alpha status. From lawrence livermore labs. It modifies
the interpreter itself to make it multi-noded.
mpi4py - ? another MPI implementation.
LINKS & DISCUSSIONS
http://wiki.python.org/moin/ParallelProcessing
http://blog.ianbicking.org/gil-of-doom.html
http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html
http://groups.google.com/group/comp.lang.python/browse_thread/thread/1f5d927d34f8f323/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/332083cdc8bc44b/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/13da24f2d6dc24a9/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/f822ec289f30b26a/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/902dbddfc31b8891
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d8fa9ad770c17c70/