Best algorithm...

ahmadcorp.m · Feb 5, 2008

Hi,

Need some ideas for this problem:

If you had a file containing boundaries [A,B] (up to 25 million) and a
file containing a list of points P (up to 10 billion). What would be
the fastest way to determine which of the listed boundaries contain a
point P?

Standard ugly std::count_if and map inserts and lookup are far to
slow for this problem. Any other ideas?

(problem is a simpler version of a derivative default map issue I'm
having)

Thanks for any help.

Phil Endecott · Feb 5, 2008

If you had a file containing boundaries [A,B] (up to 25 million) and a
file containing a list of points P (up to 10 billion). What would be
the fastest way to determine which of the listed boundaries contain a
point P?

The 25 million points will happily fit in memory.

Make a binary tree where each node represents an interval min to max,
and the two sub-trees represent intervals min to mid and mid to max.
Then store intervals that cross mid in the node. Pseudo-code:

class Node {
point_t min;
point_t mid;
point_t max;

Node_ptr leftnode;
Node_ptr rightnode;

set< pair<point_t,point_t> > cross_mid;

public:
Node(point_t min_, point_t max_):
min(min_), mid((min_+max_)/2), max(max_),
leftnode(new Node(min,mid)), // you need to postpone this until
rightnode(new Node(mid,max)) // it's needed
{}

void insert(pair<point_t,point_t> b) {
if (b.second<mid) left->insert(b)
else if (b.first>mid) right->insert(b)
else cross_mid.insert(b);
}

set< pair<point_t,point_t> > find(point_t p) {
return cross_mid.find(p) // left as an exercise

union ( (p<mid) ? leftnode : rightnode ) -> find(p);
}
};

Like all binary trees, this suffers if it becomes unbalanced. Consider
inserting in a random order, perhaps. (Are your files ordered?)

The subdivision will work best if the ranges are fairly evenly
distributed. In this case, the inserts should be O(N log N) and the
lookups each O(log N + M). If the ranges are strongly clustered some
other approach might be better.

Phil.

ahmadcorp.m · Feb 5, 2008

Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree.

We need to determine all applicable boundary values for a specific
point. Eg. point 3 is contained within 2000 of the 25 million
boundaries specified.

A binary tree would imply multiple searches to find all applicable
boundaries. Or some trickery based on the depth of node on a tree.

Phil Endecott · Feb 5, 2008

Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree.

We need to determine all applicable boundary values for a specific
point. Eg. point 3 is contained within 2000 of the 25 million
boundaries specified.

A binary tree would imply multiple searches to find all applicable
boundaries.

No, why?

As you descend the tree, you accumulate all the ranges that include your
point.

Dario Saccavino · Feb 6, 2008

Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree.

We need to determine all applicable boundary values for a specific
point. Eg. point 3 is contained within 2000 of the 25 million
boundaries specified.

After wiki-ing a while, I've found the following:

http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree

Phil's solution is very similar to "Alternative 1" in the first
article and I think it's the most appropriate for your problem.

Dario

ahmadcorp.m · Feb 6, 2008

After wiki-ing a while, I've found the following:

http://en.wikipedia.org/wiki/Interval_treehttp://en.wikipedia.org/wiki/Segment_tree

Phil's solution is very similar to "Alternative 1" in the first
article and I think it's the most appropriate for your problem.

Dario

Thanks for the references. The algorithm is much clearer now. Thanks
to all. Immensely appreciate it.

Best algorithm/data structure combination for word lookup?	8	Jul 27, 2007
Performance of list.index - how to speed up a silly algorithm?	1	Apr 29, 2010
Which is a better way to implement this algorithm?	19	Apr 20, 2008
Innovation, my TSP algorithm and factoring, timelines	38	Aug 23, 2008
high traffic/availability application and gnu_cxx::hash_map problem -better to use tr1/unordered_map	1	Jan 5, 2011
C Containers Library vs STL	14	Aug 3, 2011
Data type problem in encryption algorithm	11	Nov 25, 2005
Optimizing list processing	17	Dec 11, 2013

Best algorithm...

ahmadcorp.m

Phil Endecott

ahmadcorp.m

Phil Endecott

Dario Saccavino

ahmadcorp.m

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads