Best match searching

D

Daniel Pryde

Hi everyone.
I was wondering if anyone might be able to help me out here. I'm currently
looking to find the quickest way to find a best fit match in a large array.
My problem is that I have an array of, say, 600*400, which contains a value
at each point, and I need to find the value in that array which is closest
to the input value. It's basically some euclidean distances that I've
calculated, and I need to be able to find the best matches over a large
number of input values, so I was looking for a speedy solution to this.
The array is implemented simply using a list of lists. The only solution I
could think of is to use some for statements, but that seems to take a
while, even for just one search.
Any hints or tips would be really helpful. :)

Daniel

_________________________________________________________________
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo
 
A

anton muhin

Daniel said:
Hi everyone.
I was wondering if anyone might be able to help me out here. I'm
currently looking to find the quickest way to find a best fit match in a
large array.
My problem is that I have an array of, say, 600*400, which contains a
value at each point, and I need to find the value in that array which is
closest to the input value. It's basically some euclidean distances that
I've calculated, and I need to be able to find the best matches over a
large number of input values, so I was looking for a speedy solution to
this.
The array is implemented simply using a list of lists. The only solution
I could think of is to use some for statements, but that seems to take a
while, even for just one search.
Any hints or tips would be really helpful. :)

Daniel

_________________________________________________________________
Express yourself with cool new emoticons
http://www.msn.co.uk/specials/myemo
I'm not an expert, but I'd use numarray module. The most simple
example can look like:

a = numarray.array(....) # An array to look in
value = x # Value to look for
c = a.copy() # Create a copy
print numarray.abs(a.copy() - value).argmin() # Finds minimal absolute
difference

If performance of numarray is still too slow, I'd resort to simple
hasing scheme: you can split your inputs in reasonable ranges and look
for closest match only in this ranges.

hth,
anton.
 
C

Christos TZOTZIOY Georgiou

Hi everyone.
I was wondering if anyone might be able to help me out here. I'm currently
looking to find the quickest way to find a best fit match in a large array. [snip]
It's basically some euclidean distances that I've
calculated, and I need to be able to find the best matches over a large
number of input values, so I was looking for a speedy solution to this.

These might be helpful:

http://groups.google.com.gr/[email protected]

or

http://groups.google.com.gr/[email protected]&rnum=1

groups.google.com is your friend :)
 
P

Paul McGuire

Daniel Pryde said:
Hi everyone.
I was wondering if anyone might be able to help me out here. I'm currently
looking to find the quickest way to find a best fit match in a large array.
My problem is that I have an array of, say, 600*400, which contains a value
at each point, and I need to find the value in that array which is closest
to the input value.
Any hints or tips would be really helpful. :)

Daniel
I think inverting your matrix to a dictionary may be more "Pythonic". The
included snippet constructs an array of values, then creates a dictionary of
which values are located at which positions in the array (also handles
possibility of duplicates in the array). If you change ROWS and COLS to
your problem values of 400 and 600, the matrix-to-dictionary inversion takes
a bit of time - but the searches are blindingly fast.

-- Paul


# mat2Dict.py
from random import random
import pprint

ROWS = 4
COLS = 6

# create matrix
print "creating data matrix"
matrix = [ [ random() for i in range(COLS) ] for j in range(ROWS) ]

print "create dictionary of values"
valdict = {}
for i,row in enumerate(matrix):
for j,elem in enumerate(row):
if valdict.has_key(elem):
valdict[elem].append( (i,j) )
else:
valdict[elem] = [ (i,j) ]

# uncomment this line to view the dictionary of values
# pprint.pprint( valdict )

matvals = valdict.keys()
matvals.sort()
print len(matvals),"unique values in matrix"

def findClosestVal( searchval ):
if searchval <= matvals[0]:
return matvals[0]
if searchval >= matvals[-1]:
return matvals[-1]

# do binary search - see
http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary
hi = len(matvals)
lo = -1
while (hi-lo) > 1:
loc = ( hi + lo ) / 2
if matvals[loc] > searchval:
hi = loc
else:
lo = loc

if abs(searchval-matvals[lo]) < abs(searchval-matvals[hi]):
return matvals[lo]
else:
return matvals[hi]

# look up some values
for v in range(10):
vv = v/10.
closestVal = findClosestVal( vv )
print vv, "->", closestVal, "occurring at", valdict[closestVal]
 
T

Terry Reedy

Daniel Pryde said:
Hi everyone.
I was wondering if anyone might be able to help me out here. I'm currently
looking to find the quickest way to find a best fit match in a large array.
My problem is that I have an array of, say, 600*400, which contains a value
at each point, and I need to find the value in that array which is closest
to the input value. It's basically some euclidean distances that I've
calculated, and I need to be able to find the best matches over a large
number of input values, so I was looking for a speedy solution to this.
The array is implemented simply using a list of lists. The only solution I
could think of is to use some for statements, but that seems to take a
while, even for just one search.
Any hints or tips would be really helpful. :)

If the values in the array have any structure or constraints, try to
exploit that. If they are ordered in both directions, use binary search on
one axis and then trace the zigzag frontier between lower and height
values. If there are just a 'few' values (as in 0-255), then a dict might
work, though changing values would complicate this.

Terry J. Reedy
 
B

Brian Kelley

Daniel said:
The array is implemented simply using a list of lists. The only solution
I could think of is to use some for statements, but that seems to take a
while, even for just one search.
Ahh this brings back memories of my image processing days :)

I've implemented a couple of approaches.
1) using just one list to hold the data
2) using a list of lists
3) using numeric

I've optimized the search function for 1 and 2 a little bit by defining
the scoring function on the fly like this and binding input during the
function definition:

input = 0.5
def euclidean(x, input=input): return abs(x-input)

We can also use a bounding box technique such that if a minimum x is
found and the x < input then we can reject all points less than x
without having to perform the euclidean distance. Conversely if x >
input then we can reject all points greater than x. This really only
works if x is a single value and not a vector.

Don't forget that when looking for minimums using a euclidean distance
you don't have to perform the square root. (x1*x1+x2*x2) will find the
correct minimum as well as opposed to the more time consuming
math.sqrt(x1*x1+x2*x2)

Here are the results looking for data closest to an input value of 0.5
using a pretty loaded pentium 4 2ghz. The result is for a single search
against a random array of 600x400.

time row col value
0.130000114441 396 328 0.500001351171 single list
0.119999885559 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

So numeric is the clear winner here. However, when psyco is used
http://psyco.sourceforge.net/

time row col value
0.0400000810623 396 328 0.500001351171 single list
0.0499999523163 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

psyco actually wins here but more interesting is that the timing order
is reversed! I'll have to do a lot more trials to generate good
profiling, but I think the results are pretty good here.

Here is the test code, feel free to rip it apart as you will.
import Numeric, random, time

# create data
data = [random.random() for x in xrange(600*400)]

# test list of list
lists = []
for i in range(400): # rows
lists.append(data[i*600:(i+1)*600]) # columns

matrix = Numeric.array(lists)

def testarray(data=data):
t1 = time.time()
# straight forward approach
mn = 10e99
index = None
lx = -10e99 # keep track of the best found
hx = 10e99 # approaching from the bottom and top

input = 0.5
def euclidean(x, input=input):
return abs(x-input)

for i,x in enumerate(data):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = i

t2 = time.time()
row = index/600
col = index%600
print t2-t1, row, col, data[index], "single list"

def testlists(list=list):
mn = 10e99
index = None
input = 0.5
lx = -10e99 # keep track of the best found
hx = 10e99 # approaching from the bottom and top

def euclidean(x, input=input):
return abs(x-input)
t1 = time.time()
for r, row in enumerate(lists):
for c, x in enumerate(row):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = r,c

t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "list of lists"

def testmatrix(matrix=matrix):
t1 = time.time()
mins = abs(matrix-0.5)
mn = 10e99
index = None
for r,c in enumerate(Numeric.argmin(mins)):
y = mins[r,c]
if y < mn:
mn = y
index = r,c
t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "matrix"

# list of lists approach
testarray()
testlists()
testmatrix()

print "*"*44
print "try with psyco"
try:
import psyco
psyco.full()
testarray()
testlists()
testmatrix()
except ImportError:
print "psyco not available"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,174
Messages
2,570,940
Members
47,484
Latest member
JackRichard

Latest Threads

Top