time consuming loops over lists

querypk · Jun 7, 2005

X-No-Archive: yes
Can some one help me improve this block of code...this jus converts the
list of data into tokens based on the range it falls into...but it
takes a long time.Can someone tell me what can i change to improve
it...

def Tkz(tk,data):
no_of_bins = 10
tkns = []
dmax = max(data)+1
dmin = min(data)
rng = ceil(abs((dmax - dmin)/(no_of_bins*1.0)))
rngs = zeros(no_of_bins+1)
for i in xrange(no_of_bins+1):
rngs = dmin + (rng*i)
for i in xrange(len(data)):
for j in xrange(len(rngs)-1):
if data in xrange(rngs[j],rngs[j+1]):
tkns.append( str(tk)+str(j) )
return tkns

Diez B. Roggisch · Jun 7, 2005

X-No-Archive: yes
Can some one help me improve this block of code...this jus converts the
list of data into tokens based on the range it falls into...but it
takes a long time.Can someone tell me what can i change to improve
it...

if data in xrange(rngs[j],rngs[j+1]):

That's a bummer: You create a list and then search linearily in in -
where all you want to know is

if rngs[j] <= data < rngs[j+1]

Attached is a script that does contain your old and my enhanced version
- and shows that the results are equal. Running only your version takes
~35s, where mine uses ~1s!!!

Another optimization im too lazy now would be to do sort of a "tree
search" of data in rngs - as the ranges are ordered, you could find
the proper one in log_2(len(rngs)) instead of len(rngs)/2.

Additional improvements can be achieved when data is sorted - but that
depends on your application and actually sizes of data.

Diez

from math import *
from Numeric import *
from random import *
def Tkz2(tk,data):
no_of_bins = 10
tkns = []
dmax = max(data)+1
dmin = min(data)
rng = ceil(abs((dmax - dmin)/(no_of_bins*1.0)))
rngs = zeros(no_of_bins+1)
for i in xrange(no_of_bins+1):
rngs = dmin + (rng*i)
for i in xrange(len(data)):
for j in xrange(len(rngs)-1):
if rngs[j] <= data < rngs[j+1]:
tkns.append( str(tk)+str(j) )
return tkns

def Tkz(tk,data):
no_of_bins = 10
tkns = []
dmax = max(data)+1
dmin = min(data)
rng = ceil(abs((dmax - dmin)/(no_of_bins*1.0)))
rngs = zeros(no_of_bins+1)
for i in xrange(no_of_bins+1):
rngs = dmin + (rng*i)
for i in xrange(len(data)):
for j in xrange(len(rngs)-1):
if data in xrange(rngs[j], rngs[j+1]):
tkns.append( str(tk)+str(j) )
return tkns

data = range(20,12312)
shuffle(data)
res1 = Tkz('A', data)
res2 = Tkz2('A', data)

print res1 == res2

querypk · Jun 7, 2005

wow i dint know that a single statement like that would make such a
difference. Thanks you very much. that really improves the performance

Peter Otten · Jun 7, 2005

Can some one help me improve this block of code...this jus converts the
list of data into tokens based on the range it falls into...but it
takes a long time.Can someone tell me what can i change to improve
it...

def Tkz(tk,data):
no_of_bins = 10
tkns = []
dmax = max(data)+1
dmin = min(data)
rng = ceil(abs((dmax - dmin)/(no_of_bins*1.0)))
rngs = zeros(no_of_bins+1)
for i in xrange(no_of_bins+1):
rngs = dmin + (rng*i)
for i in xrange(len(data)):
for j in xrange(len(rngs)-1):
if data in xrange(rngs[j],rngs[j+1]):
tkns.append( str(tk)+str(j) )
return tkns

Use bisect(), e. g., with a slightly modified function signature:

from __future__ import division
import bisect
from math import ceil

def tkz(tk, data, no_of_bins=10):
dmax = max(data) + 1
dmin = min(data)
rng = ceil((dmax - dmin)/no_of_bins)
rngs = [dmin + rng*i for i in xrange(1, no_of_bins+1)]
tokens = [tk + str(i) for i in xrange(no_of_bins)]
return [tokens[bisect.bisect(rngs, v)] for v in data]

if __name__ == "__main__":
print tkz("token_", [5, 7, 8, 9, 70, 200])

What are the tokens for, by the way? I'd recommend using the indices
directly if possible.

Peter

Andrea Griffini · Jun 7, 2005

Another optimization im too lazy now would be to do sort of a "tree
search" of data in rngs - as the ranges are ordered, you could find
the proper one in log_2(len(rngs)) instead of len(rngs)/2.

I don't see a "break" so why the "/2" ? also IIUC the
ranges are more than just ordered... they're all equal
and computed by

for i in xrange(no_of_bins+1):
rngs = dmin + (rng*i)

so my guess is that instead of searching with

for j in xrange(len(rngs)-1):
if rngs[j] <= data < rngs[j+1]:

one could just do

j = int((data - dmin)/rng)

The code with this change gets roughly about twice faster.

Andrea

Diez B. Roggisch · Jun 7, 2005

I don't see a "break" so why the "/2" ? also IIUC the

That was the assumption of an equal distribution of the data. In
O-notationn this would be O(n) of course.

Diez

John Machin · Jun 7, 2005

Diez said:
X-No-Archive: yes
Can some one help me improve this block of code...this jus converts the
list of data into tokens based on the range it falls into...but it
takes a long time.Can someone tell me what can i change to improve
it...

if data in xrange(rngs[j],rngs[j+1]):

Click to expand...

That's a bummer: You create a list and then search linearily in in -
where all you want to know is

if rngs[j] <= data < rngs[j+1]

Attached is a script that does contain your old and my enhanced version
- and shows that the results are equal. Running only your version takes
~35s, where mine uses ~1s!!!

Another optimization im too lazy now would be to do sort of a "tree
search" of data in rngs - as the ranges are ordered, you could find
the proper one in log_2(len(rngs)) instead of len(rngs)/2.

Additional improvements can be achieved when data is sorted - but that
depends on your application and actually sizes of data.

Diez

[snip]
Make these changes/additions:
=================================================
# from Numeric import *
# +1 for Overkill-of-the-Year
def zeros(n):
return n * [0.0]

def Tkz3(tk, data, no_of_bins):
# indent 5 ???????
# change other funcs to have no_of_bins as an arg
tkns = []
dmax = max(data)+1
dmin = min(data)
rng = int(ceil(abs((dmax - dmin)/(no_of_bins*1.0))))
for datum in data:
j = (datum - dmin) // rng
tkns.append( str(tk)+str(j) )
return tkns

for func in (Tkz, Tkz2, Tkz3):
t0 = time.time()
result = func('A', data, nbins)
t1 = time.time()
results.append(result)
print "function %s took %.2f seconds to produce %d tokens" %
(func.__name__, t1 - t0, len(result))

allsame = results [0] == results[1] == results[2]
print "\nIdentical results:", allsame
=======================================================
and get this:

C:\junk>ranges.py
C:\junk\ranges.py:46: DeprecationWarning: integer argument expected, got
float
if data in xrange(rngs[j], rngs[j+1]):
function Tkz took 12.13 seconds to produce 12292 tokens
function Tkz2 took 0.08 seconds to produce 12292 tokens
function Tkz3 took 0.02 seconds to produce 12292 tokens

Identical results: True

==================================================
using Python 2.4.1 (win32) on a 3.2Ghz Intel P4
==================================================

Notes: (1) The OP's code as well as being deliciously O(N**2) depends on
the data beint *integers* -- otherwise it goes rapidly pear-shaped; see
the deprecation warning above, and try running it with data = [0.00001*
x for x in range(20, 12312)]. Consequently the use of math.* is a /mild/
overkill.

(2) The OP's code produces bins of equal nominal width but depending in
the range, the population of the last bin may be much less than the
population of the other bins. Another way of doing it would be to
produce bin widths so that the populations were more evenly spread. In
any case determining the appropriate bin could still be done by formula
rather than by searching.

Cheers,
John

Andrea Griffini · Jun 7, 2005

That was the assumption of an equal distribution of the data. In
O-notationn this would be O(n) of course.

It was a joke ... the issue is that there was no break
statement

i.e. your code kept searching even after
finding the proper range!

Actually that break (with 10 bins) is not terribly
important because the cost of the comparision is
small compared to the cost of append. The timings
I got are..

your code 1.26 sec
adding break 0.98 sec
direct index computation 0.56 sec

10 bins are so few that with just low-level speedups
(i.e. precomputing a list of ranges and str(j)) the
code that does a linear scan requires just 0.60 seconds.

Hand optimizing the direct computation code the
execution time gets down to 0.3 seconds; the inner loop
i used is:

for i, x in enumerate(data):
j = int((x - dmin)/rng)
tkns = tks + js[j]

with data = range(20, 123120)

Andrea

Peter Otten · Jun 8, 2005

X-No-Archive: yes
Can some one help me improve this block of code...this jus converts the
list of data into tokens based on the range it falls into...but it
takes a long time.Can someone tell me what can i change to improve
it...

def Tkz(tk,data):
no_of_bins = 10
tkns = []
dmax = max(data)+1
dmin = min(data)
rng = ceil(abs((dmax - dmin)/(no_of_bins*1.0)))
rngs = zeros(no_of_bins+1)
for i in xrange(no_of_bins+1):
rngs = dmin + (rng*i)
for i in xrange(len(data)):
for j in xrange(len(rngs)-1):
if data in xrange(rngs[j],rngs[j+1]):
tkns.append( str(tk)+str(j) )
return tkns

On second thought, with bins of equal size you have an option that is even
faster than bisect:

def bins_calc(tk, data, no_of_bins=10):
dmax = max(data) + 1
dmin = min(data)
delta = dmax - dmin
rng = int(ceil((dmax - dmin)/no_of_bins))
tokens = [tk + str(i) for i in xrange(no_of_bins)]
return [tokens[(v-dmin)//rng] for v in data]

Peter

speed up a numpy code with huge array	5	May 25, 2010
Into itertools	5	Apr 26, 2009
[ANN] Multidimensional Array - MDArray 0.5.5	0	Nov 19, 2013
extension_pack	9	Jun 3, 2005
extension pack	0	May 6, 2005
extension_pack	0	Jan 6, 2006
prePEP: Decimal data type	21	Oct 31, 2003
extension package	0	Mar 24, 2005

time consuming loops over lists

querypk

Diez B. Roggisch

querypk

Peter Otten

Andrea Griffini

Diez B. Roggisch

John Machin

Andrea Griffini

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads