D
Demian Brecht
Hi all,
Some work that I'm doing atm is in some serious need of
parallelization. As such, I've been digging into the multiprocessing
module more than I've had to before and I had a few questions come up
as a result:
(Running 2.7.5+ on OSX)
1. From what I've read, a new Python interpreter instance is kicked
off for every worker. My immediate assumption was that the file that
the code was in would be reloaded for every instance. After some
digging, this is obviously not the case (print __name__ at the top of
the file only yield a single output line). So, I'm assuming that
there's some optimization that passes of the bytecode within the
interpreter? How, exactly does this work? (I couldn't really find much
in the docs about it, or am I just not looking in the right place?)
2. For cases using methods such as map_async/wait, once the bytecode
has been passed into the child process, `target` is called `n` times
until the current queue is empty. Is this correct?
3. Because __main__ is only run when the root process imports, if
using global, READ-ONLY objects, such as, say, a database connection,
then it might be better from a performance standpoint to initialize
that at main, relying on the interpreter references to be passed
around correctly. I've read some blogs and such that suggest that you
should create a new database connection within your child process
targets (or code called into by the targets). This seems to be less
than optimal to me if my assumption is correct.
4. Related to 3, read-only objects that are initialized prior to being
passed into a sub-process are safe to reuse as long as they are
treated as being immutable. Any other objects should use one of the
shared memory features.
Is this more or less correct, or am I just off my rocker?
Thanks,
Some work that I'm doing atm is in some serious need of
parallelization. As such, I've been digging into the multiprocessing
module more than I've had to before and I had a few questions come up
as a result:
(Running 2.7.5+ on OSX)
1. From what I've read, a new Python interpreter instance is kicked
off for every worker. My immediate assumption was that the file that
the code was in would be reloaded for every instance. After some
digging, this is obviously not the case (print __name__ at the top of
the file only yield a single output line). So, I'm assuming that
there's some optimization that passes of the bytecode within the
interpreter? How, exactly does this work? (I couldn't really find much
in the docs about it, or am I just not looking in the right place?)
2. For cases using methods such as map_async/wait, once the bytecode
has been passed into the child process, `target` is called `n` times
until the current queue is empty. Is this correct?
3. Because __main__ is only run when the root process imports, if
using global, READ-ONLY objects, such as, say, a database connection,
then it might be better from a performance standpoint to initialize
that at main, relying on the interpreter references to be passed
around correctly. I've read some blogs and such that suggest that you
should create a new database connection within your child process
targets (or code called into by the targets). This seems to be less
than optimal to me if my assumption is correct.
4. Related to 3, read-only objects that are initialized prior to being
passed into a sub-process are safe to reuse as long as they are
treated as being immutable. Any other objects should use one of the
shared memory features.
Is this more or less correct, or am I just off my rocker?
Thanks,