Execution speed question

S

Suresh Pillai

I am performing simulations on networks (graphs). I have a question on
speed of execution (assuming very ample memory for now). I simplify the
details of my simulation below, as the question I ask applies more
generally than my specific case. I would greatly appreciate general
feedback in terms of computing and of course considerations specific to
implementation in Python.

The nodes in my network may be ON or OFF. The network starts off with
all nodes in the OFF state. I loop through the nodes. For each node
that is OFF, I consider some probability of it turning ON based on the
states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
WHICH ONES TO TURN ON.

So my question is whether it is faster to

1. loop through a list of ALL nodes and check for OFF nodes using ifs

or to

2. loop through a container of OFF nodes and remove from this when they
turn ON

The second would result in looping through less nodes, especially as the
simulation progresses, but how does the cost of removal compare with
cheap ifs and would the extra memory usage affect performance.

I an appreciate that the cost of the if check, the number of nodes, and
the type of container I use will come into the answer.

In my case, the ifs are cheap boolean queries (whether the node is ON or
OFF). The number of nodes is very large: millions for sure, maybe tens
of millions. If considering (2), take note of my BOLD text above, which
means I can't remove nodes as I iterate through them in the main loop.

I naturally started coding with (2), but couldn't decide on the best data
structure for python. A set seemed ideal for speedy removal, but then I
can't iterate through them with out popping. An ordered list? Some
creative solution with numpy arrays?

There is also the complication that since we are in interpreted python,
what is theoretically the best data structure may not in reality be
optimal unless it is a default system object or coded externally in a
compiled module.

Of course, I will start experimenting to see what the execution
difference is, but I would appreciate some suggestions from others re
which is best and also on best data structure for (2).

I'm not a newbie, so you can get technical with me python-wise and
algorithm wise. I realise it is a 'basic' question, but it is something
that I have always wondered about (cheap ifs versus extra structure) and
with the number of nodes I am considering, it actually becomes an issue.

Many Thanks,
Suresh
 
A

alex23

The nodes in my network may be ON or OFF.  The network starts off with
all nodes in the OFF state.  I loop through the nodes.  For each node
that is OFF, I consider some probability of it turning ON based on the
states of its neighbours.  I MUST GO THROUGH ALL NODES BEFORE DECIDING
WHICH ONES TO TURN ON.

So my question is whether it is faster to

1. loop through a list of ALL nodes and check for OFF nodes using ifs

I'd recommend using 'filter' and list comprehensions.
... def __init__(self):
... self.on = False
... def toggle(self):
... self.on = random.choice([True, False])
...
nodes = [Node() for i in range(0, 10000)]
for node in nodes:
... node.toggle()
...5050
 
J

Jeff

I'd recommend using 'filter' and list comprehensions.

Look at using reduce(). You can collect information about all of the
nodes without necessarily building a large, intermediate list in the
process.

You might get some ideas from here [http://en.wikipedia.org/wiki/
Antiobjects].
 
I

Iain King

I am performing simulations on networks (graphs). I have a question on
speed of execution (assuming very ample memory for now). I simplify the
details of my simulation below, as the question I ask applies more
generally than my specific case. I would greatly appreciate general
feedback in terms of computing and of course considerations specific to
implementation in Python.

The nodes in my network may be ON or OFF. The network starts off with
all nodes in the OFF state. I loop through the nodes. For each node
that is OFF, I consider some probability of it turning ON based on the
states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
WHICH ONES TO TURN ON.

So my question is whether it is faster to

1. loop through a list of ALL nodes and check for OFF nodes using ifs

or to

2. loop through a container of OFF nodes and remove from this when they
turn ON

or 3. build a new list every iteration intead of deleting from the old
one:

while processing:
new_off_list = []
for x in off_list:
if goes_on(x):
on_list.append(x)
else:
new_off_list.append(x)
off_list = new_off_list
generation += 1

Iain
 
I

Iain King

I am performing simulations on networks (graphs). I have a question on
speed of execution (assuming very ample memory for now). I simplify the
details of my simulation below, as the question I ask applies more
generally than my specific case. I would greatly appreciate general
feedback in terms of computing and of course considerations specific to
implementation in Python.
The nodes in my network may be ON or OFF. The network starts off with
all nodes in the OFF state. I loop through the nodes. For each node
that is OFF, I consider some probability of it turning ON based on the
states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
WHICH ONES TO TURN ON.
So my question is whether it is faster to
1. loop through a list of ALL nodes and check for OFF nodes using ifs
2. loop through a container of OFF nodes and remove from this when they
turn ON

or 3. build a new list every iteration intead of deleting from the old
one:

while processing:
new_off_list = []
for x in off_list:
if goes_on(x):
on_list.append(x)
else:
new_off_list.append(x)
off_list = new_off_list
generation += 1

Iain

I was curious to what extent the different methods varied in time, so
I checked it out. Here there are three procedures: test_every which
matches your (1), destructive which matches your (2), and constructive
which is (3) as I've outlined above.

On varying the size of the dataset I get this (probability a node goes
on = 50%):

Length of initial list: 100000
Test every: 1.16085492357
Destructive: 2.592310272
Constructive: 0.850312458886

Length of initial list: 200000
Test every: 2.48013843287
Destructive: 9.20894689718
Constructive: 1.73562198439

Length of initial list: 400000
Test every: 5.00652267447
Destructive: 44.9696004134
Constructive: 3.51687329373

Length of initial list: 800000
Test every: 9.67657648655
Destructive: 220.57583941
Constructive: 7.06614485537


and changing the probability that a nodes goes on (dataset size =
200000):


Probability goes on: 1/2
Test every: 2.24765364513
Destructive: 9.28801971614
Constructive: 1.62770773816

Probability goes on: 1/4
Test every: 4.77387350904
Destructive: 13.4432467571
Constructive: 3.45467140006

Probability goes on: 1/8
Test every: 11.0514899721
Destructive: 18.4026878278
Constructive: 6.86778036177

Probability goes on: 1/16
Test every: 22.5896021593
Destructive: 25.7784044083
Constructive: 13.8631404605

Probability goes on: 1/32
Test every: 49.7667941179
Destructive: 39.3652502735
Constructive: 27.2527219598

Probability goes on: 1/64
Test every: 91.0523955153
Destructive: 65.7747103963
Constructive: 54.4087322936

Code:

import random
from timeit import Timer

SIZE = 100000
MAX = 2

def goes_on(x):
global MAX
return random.randint(1,MAX) == 1

def test_every():
global SIZE
print "Test every:",
nodes = range(SIZE)
is_on = [False for x in xrange(SIZE)]
count = SIZE
while count:
for i,x in enumerate(nodes):
if not is_on and goes_on(x):
is_on = True
count -= 1

def destructive():
global SIZE
print "Destructive:",
off_list = range(SIZE)
on_list = []
count = SIZE
while count:
for i in xrange(len(off_list)-1, -1, -1):
x = off_list
if goes_on(x):
on_list.append(x)
del(off_list)
count -= 1

def constructive():
global SIZE
print "Constructive:",
off_list = range(SIZE)
on_list = []
count = SIZE
while count:
new_off_list = []
for x in off_list:
if goes_on(x):
on_list.append(x)
count -= 1
else:
new_off_list.append(x)
off_list = new_off_list

#SIZE = 200000
while True:
print "Length of initial list:", SIZE
#print "Probability goes on: 1/%d" % MAX
print Timer("test_every()", "from __main__ import
test_every").timeit(1)
print Timer("destructive()", "from __main__ import
destructive").timeit(1)
print Timer("constructive()", "from __main__ import
constructive").timeit(1)
print
SIZE *= 2
#MAX *= 2



Conclusions:

On size, (2) really doesn't like bigger datasets, taking exponentially
longer as it increases, while (1) and (3) happily increase linearly.
(3) is faster.

On probability it's (1) who's the loser, while (2) and (3) are happy.
(3) is once again faster.

I think (2)'s poor performance is being amplified by how python
handles lists and list deletions; the effect may be stymied in other
languages, or by using other data constructs in python (like a
dictionary or a user made list class). If you were short on memory
then (2) would have an advantage, but as it is, (3) is the clear
winner.
I'm a fan of list comprehensions, and it feels like they could be nice
here, but since we are making two lists at once here I don't see how
to... anyone see how to use them (or 'map' if you want to be old
school)?

Iain
 
A

alex23

Look at using reduce().  You can collect information about all of the
nodes without necessarily building a large, intermediate list in the
process.

From the OP's description, I assumed there'd be a list of all nodes,
from which he wishes to derive a 2nd list of specific nodes. reduce()
applies "a function of two arguments cumulatively to the items of a
sequence, from left to right, so as to reduce the sequence to a single
value", which doesn't seem to me to be what the OP was asking for. I
could understand using map() across the filter'd list, or embedding
the conditional check within the map'd function and ignoring filter(),
but at no point does the OP ask to perform any kind of function based
on two nodes...

I may have misunderstand your point, though :) Could you provide a
quick code sample to clarify?
 
S

Suresh Pillai

That's a good comparison for the general question I posed. Thanks.
Although I do believe lists are less than ideal here and a different data
structure should be used.

To be more specific to my case:
As mentioned in my original post, I also have the specific condition that
one does not know which nodes to turn ON until after all the
probabilities are calculated (lets say we take the top m for example).
In this case, the second and third will perform worse as the second one
will require a remove from the list after the fact and the third will
require another loop through the nodes to build the new list.
 
F

Fredrik Lundh

Iain said:
I think (2)'s poor performance is being amplified by how python
handles lists and list deletions; the effect may be stymied in other
languages

Delete is O(n) (or "O(n/2) on average", if you prefer), while append is
amortized O(1).

Unless I'm missing something, your example keeps going until it's
flagged *all* nodes as "on", which, obviously, kills performance for the
first version as the probability goes down. The OP's question was about
a single pass (but he did mention "as the simulation progresses", so I
guess it's fair to test a complete simulation.)

Btw, if the nodes can be enumerated, I'd probably do something like:

node_list = ... get list of nodes ...
random.shuffle(node_list)

start = 0
end = len(node_list)
step = end / MAX

while start < end:

for i in xrange(start, start + step):
... switch on node_list ...

... do whatever you want to do after a step ...

# prepare for next simulation step
start += step
step = max((len(node_list) - start) / MAX, 1)

which is near O(n) overall, and mostly constant wrt. the probability for
each pass (where the probability is 1:MAX).

Might need some tuning; tweak as necessary.

</F>
 
S

Suresh Pillai

Unless I'm missing something, your example keeps going until it's
flagged *all* nodes as "on", which, obviously, kills performance for the
first version as the probability goes down. The OP's question was about
a single pass (but he did mention "as the simulation progresses", so I
guess it's fair to test a complete simulation.)

I was referring to multiple passes as in Iain' test cases. Although not
necessarily till all nodes are ON, let's say to to a large proportion at
least.
 
I

Iain King

That's a good comparison for the general question I posed. Thanks.
Although I do believe lists are less than ideal here and a different data
structure should be used.

To be more specific to my case:
As mentioned in my original post, I also have the specific condition that
one does not know which nodes to turn ON until after all the
probabilities are calculated (lets say we take the top m for example).
In this case, the second and third will perform worse as the second one
will require a remove from the list after the fact and the third will
require another loop through the nodes to build the new list.

So you need to loops through twice regardless? i.e. loop once to
gather data on off nodes, do some calculation to work out what to turn
on, then loop again to turn on the relevant nodes? If so, then I
think the functions above remain the same, becoming the 2nd loop.
Every iteration you do a first loop over the off_nodes (or them all
for (1)) to gather the data on them, perform your calculation, and
then perform one of the above functions (minus the setup code at the
begining; basically starting at the 'for') as a second loop, with the
goes_on function now returning a value based on the calculation
(rather than the calculation itself as I had it). Performance should
be similar.

Iain
 
M

Matthew Fitzgibbons

Suresh said:
That's a good comparison for the general question I posed. Thanks.
Although I do believe lists are less than ideal here and a different data
structure should be used.

To be more specific to my case:
As mentioned in my original post, I also have the specific condition that
one does not know which nodes to turn ON until after all the
probabilities are calculated (lets say we take the top m for example).
In this case, the second and third will perform worse as the second one
will require a remove from the list after the fact and the third will
require another loop through the nodes to build the new list.

It seems like the probability calculation applies to all three equally,
and can therefore be ignored for the simulations. You said that your
algorithm must be a two-stage process: (A) calculate the probabilities
then (B) turn on some nodes. Iain's simulations assume (A) is already
done. He just addressed the performance of (B). Part (A) is invariant
for all his simulations, because your requirement forces it to be.

As for different data structures, it largely depends on how you need to
access the data. If you don't need to index the data, just loop through
it, you might try a linked list. The performance hit in (2) is coming
from the list del; using a linked list would make the removal constant
rather than O(n), and may even execute faster than (3) as well.

-Matt
 
I

Iain King

It seems like the probability calculation applies to all three equally,
and can therefore be ignored for the simulations.

The probability affects (1) more. My reasoning for this being: as
probability gets lower the number of times you have to loop over the
list increases. (1) always loops over the full list, but with each
successive iteration (2) and (3) are looping over smaller and smaller
lists. In the end this adds up, with (1) becoming slower than (2),
even though it starts out quicker.

Iain
 
M

Matthew Fitzgibbons

Iain said:
The probability affects (1) more. My reasoning for this being: as
probability gets lower the number of times you have to loop over the
list increases. (1) always loops over the full list, but with each
successive iteration (2) and (3) are looping over smaller and smaller
lists. In the end this adds up, with (1) becoming slower than (2),
even though it starts out quicker.

Iain

I meant the _calculation_ of the probability affects all three equally,
not the value itself. As your simulations show, different probabilities
affect the algorithms differently; I'm talking about the algorithm to
arrive at the probability value.

-Matt
 
T

Terry Reedy

Suresh said:
I am performing simulations on networks (graphs). I have a question on
speed of execution (assuming very ample memory for now). I simplify the
details of my simulation below, as the question I ask applies more
generally than my specific case. I would greatly appreciate general
feedback in terms of computing and of course considerations specific to
implementation in Python.

The nodes in my network may be ON or OFF. The network starts off with
all nodes in the OFF state. I loop through the nodes. For each node
that is OFF, I consider some probability of it turning ON based on the
states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
WHICH ONES TO TURN ON.

If the nodes do not have to be processed in any particular order, then
you could keep them either in a dict, with the value being either On or
Off (True,False)(plus connection data) or a pair of sets, one for On and
one for Off. The advantage of the dict is that the items would be fixed
and only their values would change, but you needlessly scan through On
items. The advantage of the set pair is that you only scan through Off
items but have to move some from Off to On. I will not guess which
would be faster over a complete run, or how this will compare with using
lists.

tjr
 
E

Eric Wertman

The number of nodes is very large: millions for sure, maybe tens
of millions. If considering (2), take note of my BOLD text above, which
means I can't remove nodes as I iterate through them in the main loop.

Since your use of 'node' is pretty vague and I don't have a good sense
of what tests you are running and how long they would take, I'm only
speculating, but a single loop might be the wrong way to go about
that.

If you are going to be frequently running tests and switching nodes
on/off, have you considered a separate set of processes to do both?
For example:


A set of some number of "tester" threads, that loop through and test,
recording thier results (somewhere).


You could then have a separate loop that runs every so often, checks
all the current test values, and runs through the nodes once,
switching them on or off.


I know it's not exactly what you asked, but depending on what your
nodes are exactly, you can avoid a lot of other problems down the
road. What if your single loop dies or gets hung on a test? With a
separate approach, you'll have a lot more resilience too.. if there's
some problem with a single tester or node, it won't keep the rest of
the program from continuing to function.
 
S

Suresh Pillai

So you need to loops through twice regardless? i.e. loop once to gather
data on off nodes, do some calculation to work out what to turn on, then
loop again to turn on the relevant nodes? If so, then I think the
functions above remain the same, becoming the 2nd loop. Every iteration
you do a first loop over the off_nodes (or them all for (1)) to gather
the data on them, perform your calculation, and then perform one of the
above functions (minus the setup code at the begining; basically
starting at the 'for') as a second loop, with the goes_on function now
returning a value based on the calculation (rather than the calculation
itself as I had it). Performance should be similar.

Iain

If do I settle on an explicit loop to remove the nodes turned ON, then I
realised this weekend that I could do this in the next iteration of the
simulation (first loop above) and save some iteration overhead (the if
checking will still be there of course).

And thanks for pointing out that constructing a new list, for long lists,
is faster than simple removal. It's obvious but I never really thought
of it; good tip.
 
S

Suresh Pillai

As for different data structures, it largely depends on how you need to
access the data. If you don't need to index the data, just loop through
it, you might try a linked list. The performance hit in (2) is coming
from the list del; using a linked list would make the removal constant
rather than O(n), and may even execute faster than (3) as well.

-Matt

Yes, this was my first inclination. So my question, as alluded to in my
original post, is whether there are C compiled modules for linked lists,
doubly linked lists, ordered lists ... (the standard data structures)
somewhere, to get the extra performance out of them.

With python we have all built up creative ways of using the native
structures for efficiency reasons. This project was the first time (due
to its extreme use of resources) that I've had to worry about these
minute considerations of native vs new structure but also take into
account native vs python level construct vs compiled module.

[P.S. The linked list does compare well with (3) as expected.]
 
S

Suresh Pillai

or 3. build a new list every iteration intead of deleting from the old
one:

while processing:
new_off_list = []
for x in off_list:
if goes_on(x):
on_list.append(x)
else:
new_off_list.append(x)
off_list = new_off_list
generation += 1

Iain

Or 4, since the order of my nodes doesn't matter: swap the node to be
deleted with the last node in the list and then remove the last node of
the list. This is the fastest to date, if using native structures, for
low number nodes being deleted per cycle (def if only deleting one).
 
S

Suresh Pillai

If the nodes do not have to be processed in any particular order, then
you could keep them either in a dict, with the value being either On or
Off (True,False)(plus connection data) or a pair of sets, one for On and
one for Off. The advantage of the dict is that the items would be fixed
and only their values would change, but you needlessly scan through On
items. The advantage of the set pair is that you only scan through Off
items but have to move some from Off to On. I will not guess which
would be faster over a complete run, or how this will compare with using
lists.

tjr

Thanks for the reply. As mentioned in my original post, sets came to
mind straight way, doing it the way you suggest. I alluded to, but
didn't directly ask:

Since I am doing A LOT of loops over the nodes and the number of nodes is
also huge, my concern using sets is that in order to iterate over the set
in each step of my simulation, the set items need to be converted to a
list every time. So while removal from a set is much cheaper than say
from a list, what about this conversion overhead in order to iterate over
the items.

The dict suggestion is good. Originally I had my nodes as objects, with
a networkx object for the graph (which is a dict). Since efficiency is
the most important this for this piece of code, I may decide to forget
about abstract nodes and put all attributes in a dict as you suggest.
Too many permutations, which is why I made the original post, hoping
wiser python coders could eliminate a few possibilities. :)
 
B

bearophileHUGS

Suresh Pillai:
Or 4, since the order of my nodes doesn't matter: swap the node to be
deleted with the last node in the list and then remove the last node of
the list. This is the fastest to date, if using native structures, for
low number nodes being deleted per cycle (def if only deleting one).

Using Psyco this suggestion may lead to code as fast as it gets in
Python :)

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top