Multiple scripts versus single multi-threaded script

J

JL

What is the difference between running multiple python scripts and a single multi-threaded script? May I know what are the pros and cons of each approach? Right now, my preference is to run multiple separate python scripts because it is simpler.
 
R

Roy Smith

JL said:
What is the difference between running multiple python scripts and a single
multi-threaded script? May I know what are the pros and cons of each
approach? Right now, my preference is to run multiple separate python scripts
because it is simpler.

First, let's take a step back and think about multi-threading vs.
multi-processing in general (i.e. in any language).

Threads are lighter-weight. That means it's faster to start a new
thread (compared to starting a new process), and a thread consumes fewer
system resources than a process. If you have lots of short-lived tasks
to run, this can be significant. If each task will run for a long time
and do a lot of computation, the cost of startup becomes less of an
issue because it's amortized over the longer run time.

Threads can communicate with each other in ways that processes can't.
For example, file descriptors are shared by all the threads in a
process, so one thread can open a file (or accept a network connection),
then hand the descriptor off to another thread for processing. Threads
also make it easy to share large amounts of data because they all have
access to the same memory. You can do this between processes with
shared memory segments, but it's more work to set up.

The downside to threads is that all of of this sharing makes them much
more complicated to use properly. You have to be aware of how all the
threads are interacting, and mediate access to shared resources. If you
do that wrong, you get memory corruption, deadlocks, and all sorts of
(extremely) difficult to debug problems. A lot of the really hairy
problems (i.e. things like one thread continuing to use memory which
another thread has freed) are solved by using a high-level language like
Python which handles all the memory allocation for you, but you can
still get deadlocks and data corruption.

So, the full answer to your question is very complicated. However, if
you're looking for a short answer, I'd say just keep doing what you're
doing using multiple processes and don't get into threading.
 
C

Chris Angelico

What is the difference between running multiple python scripts and a single multi-threaded script? May I know what are the pros and cons of each approach? Right now, my preference is to run multiple separate python scripts because it is simpler.

(Caveat: The below is based on CPython. If you're using IronPython,
Jython, or some other implementation, some details may be a little
different.)

Multiple threads can share state easily by simply referencing each
other's variables, but the cost of that is that they'll never actually
execute simultaneously. If you want your scripts to run in parallel on
multiple CPUs/cores, you need multiple processes. But if you're doing
something I/O bound (like servicing sockets), threads work just fine.

As to using separate scripts versus the multiprocessing module, that's
purely a matter of what looks cleanest. Do whatever suits your code.

ChrisA
 
C

Chris Angelico

The downside to threads is that all of of this sharing makes them much
more complicated to use properly. You have to be aware of how all the
threads are interacting, and mediate access to shared resources. If you
do that wrong, you get memory corruption, deadlocks, and all sorts of
(extremely) difficult to debug problems. A lot of the really hairy
problems (i.e. things like one thread continuing to use memory which
another thread has freed) are solved by using a high-level language like
Python which handles all the memory allocation for you, but you can
still get deadlocks and data corruption.

With CPython, you don't have any headaches like that; you have one
very simple protection, a Global Interpreter Lock (GIL), which
guarantees that no two threads will execute Python code
simultaneously. No corruption, no deadlocks, no hairy problems.

ChrisA
 
R

Roy Smith

Chris Angelico said:
With CPython, you don't have any headaches like that; you have one
very simple protection, a Global Interpreter Lock (GIL), which
guarantees that no two threads will execute Python code
simultaneously. No corruption, no deadlocks, no hairy problems.

ChrisA

Well, the GIL certainly eliminates a whole range of problems, but it's
still possible to write code that deadlocks. All that's really needed
is for two threads to try to acquire the same two resources, in
different orders. I'm running the following code right now. It appears
to be doing a pretty good imitation of a deadlock. Any similarity to
current political events is purely intentional.

import threading
import time

lock1 = threading.Lock()
lock2 = threading.Lock()

class House(threading.Thread):
def run(self):
print "House starting..."
lock1.acquire()
time.sleep(1)
lock2.acquire()
print "House running"
lock2.release()
lock1.release()

class Senate(threading.Thread):
def run(self):
print "Senate starting..."
lock2.acquire()
time.sleep(1)
lock1.acquire()
print "Senate running"
lock1.release()
lock2.release()

h = House()
s = Senate()

h.start()
s.start()

Similarly, I can have data corruption. I can't get memory corruption in
the way you can get in a C/C++ program, but I can certainly have one
thread produce data for another thread to consume, and then
(incorrectly) continue to mutate that data after it relinquishes
ownership.

Let's say I have a Queue. A producer thread pushes work units onto the
Queue and a consumer thread pulls them off the other end. If my
producer thread does something like:

work = {'id': 1, 'data': "The Larch"}
my_queue.put(work)
work['id'] = 3

I've got a race condition where the consumer thread may get an id of
either 1 or 3, depending on exactly when it reads the data from its end
of the queue (more precisely, exactly when it uses that data).

Here's a somewhat different example of data corruption between threads:

import threading
import random
import sys

sketch = "The Dead Parrot"

class T1(threading.Thread):
def run(self):
current_sketch = str(sketch)
while 1:
if sketch != current_sketch:
print "Blimey, it's changed!"
return

class T2(threading.Thread):
def run(self):
sketches = ["Piranah Brothers",
"Spanish Enquisition",
"Lumberjack"]
while 1:
global sketch
sketch = random.choice(sketches)

t1 = T1()
t2 = T2()
t2.daemon = True

t1.start()
t2.start()

t1.join()
sys.exit()
 
C

Chris Angelico

Well, the GIL certainly eliminates a whole range of problems, but it's
still possible to write code that deadlocks. All that's really needed
is for two threads to try to acquire the same two resources, in
different orders. I'm running the following code right now. It appears
to be doing a pretty good imitation of a deadlock. Any similarity to
current political events is purely intentional.

Right. Sorry, I meant that the GIL protects you from all that
happening in the lower level code (even lower than the Senate, here),
but yes, you can get deadlocks as soon as you acquire locks. That's
nothing to do with threading, you can have the same issues with
databases, file systems, or anything else that lets you lock
something. It's a LOT easier to deal with deadlocks or data corruption
that occurs in pure Python code than in C, since Python has awesome
introspection facilities... and you're guaranteed that corrupt data is
still valid Python objects.

As to your corrupt data example, though, I'd advocate a very simple
system of object ownership: as soon as the object has been put on the
queue, it's "owned" by the recipient and shouldn't be mutated by
anyone else. That kind of system generally isn't hard to maintain.

ChrisA
 
D

Dave Angel

With CPython, you don't have any headaches like that; you have one
very simple protection, a Global Interpreter Lock (GIL), which
guarantees that no two threads will execute Python code
simultaneously. No corruption, no deadlocks, no hairy problems.

ChrisA

The GIL takes care of the gut-level interpreter issues like reference
counts for shared objects. But it does not avoid deadlock or hairy
problems. I'll just show one, trivial, problem, but many others exist.

If two threads process the same global variable as follows,
myglobal = myglobal + 1

Then you have no guarantee that the value will really get incremented
twice. Presumably there's a mutex/critsection function in the threading
module that can make this safe, but once you use it in two different
places, you raise the possibility of deadlock.

On the other hand, if you're careful to have the thread use only data
that is unique to that thread, then it would seem to be safe. However,
you still have the same risk if you call some library that wasn't
written to be thread safe. I'll assume that print() and suchlike are
safe, but some third party library could well use the equivalent of a
global variable in an unsafe way.
 
R

Roy Smith

Chris Angelico said:
As to your corrupt data example, though, I'd advocate a very simple
system of object ownership: as soon as the object has been put on the
queue, it's "owned" by the recipient and shouldn't be mutated by
anyone else.

Well, sure. I agree with you that threading in Python is about a
zillion times easier to manage than threading in C/C++, but there are
still things you need to think about when using threading in Python
which you don't need to think about if you're not using threading at
all. Transfer of ownership when you put something on a queue is one of
those things.

So, I think my original statement:
if you're looking for a short answer, I'd say just keep doing what
you're doing using multiple processes and don't get into threading.

is still good advice for somebody who isn't sure they need threads.

On the other hand, for somebody who is interested in learning about
threads, Python is a great platform to learn because you get to
experiment with the basic high-level concepts without getting bogged
down in pthreads minutiae. And, as Chris pointed out, if you get it
wrong, at least you've still got valid Python objects to puzzle over,
not a smoking pile of bits on the floor.
 
C

Chris Angelico

So, I think my original statement:


is still good advice for somebody who isn't sure they need threads.

On the other hand, for somebody who is interested in learning about
threads, Python is a great platform to learn because you get to
experiment with the basic high-level concepts without getting bogged
down in pthreads minutiae. And, as Chris pointed out, if you get it
wrong, at least you've still got valid Python objects to puzzle over,
not a smoking pile of bits on the floor.

Agree wholeheartedly to both halves. I was just explaining a similar
concept to my brother last night, with regard to network/database
request handling:

1) The simplest code starts, executes, and finishes, with no threads,
fork(), or other confusions.or shared state or anything. Execution can
be completely predicted by eyeballing the source code. You can pretend
that you have a dedicated CPU core that does nothing but run your
program.

2) Threaded code adds a measure of complexity that you have to get
your head around. Now you need to concern yourself with preemption,
multiple threads doing things in different orders, locking, shared
state, etc, etc. But you can still pretend that the execution of one
job will happen as a single "thing", top down, with predictable
intermediate state, if you like. (Python's threading and multiprocess
modules both follow this style, they just have different levels of
shared state.)

3) Asynchronous code adds significantly more "get your head around"
complexity, since you now have to retain state for multiple
jobs/requests in the same thread. You can't use local variables to
keep track of where you're up to. Most likely, your code will do some
tiny thing, update the state object for that request, fire off an
asynchronous request of your own (maybe to the hard disk, with a
callback when the data's read/written), and then return, back to some
main loop.

Now imagine you have a database written in style #1, and you have to
drag it, kicking and screaming, into the 21st century. Oh look, it's
easy! All you have to do is start multiple threads doing the same job!
And then you'll have some problems with simultaneous edits, so you put
some big fat locks all over the place to prevent two threads from
doing the same thing at the same time. Even if one of those threads
was handling something interactive and might hold its lock for some
number of minutes. Suboptimal design, maybe, but hey, it works right?
That's what my brother has to deal with every day, as a user of said
database... :|

ChrisA
 
J

Jeremy Sanders

Roy said:
Threads are lighter-weight. That means it's faster to start a new
thread (compared to starting a new process), and a thread consumes fewer
system resources than a process. If you have lots of short-lived tasks
to run, this can be significant. If each task will run for a long time
and do a lot of computation, the cost of startup becomes less of an
issue because it's amortized over the longer run time.

This might be true on Windows, but I think on Linux process overheads are
pretty similar to threads, e.g.
http://stackoverflow.com/questions/807506/threads-vs-processes-in-linux

Combined with the lack of a GIL-conflict, processes can be pretty efficient.

Jeremy
 
G

Grant Edwards

Threads are lighter-weight. That means it's faster to start a new
thread (compared to starting a new process), and a thread consumes
fewer system resources than a process.

That's true, but the extent to which it's true varies considerably
from one OS to another. Starting processes is typically very cheap on
Unix systems. On Linux a thread and a process are actually both
started by the same system call, and the only significant difference
is how some of the new page descriptors are set up (they're
copy-on-write instead of shared).

On other OSes, starting a process is _way_ more expensive/slow than
starting a thread. That was very true for VMS, so one suspects it
might also be true for its stepchild MS-Window.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,810
Latest member
Kassie0918

Latest Threads

Top