Best algorithm...

A

ahmadcorp.m

Hi,

Need some ideas for this problem:

If you had a file containing boundaries [A,B] (up to 25 million) and a
file containing a list of points P (up to 10 billion). What would be
the fastest way to determine which of the listed boundaries contain a
point P?

Standard ugly std::count_if and map inserts and lookup are far to
slow for this problem. Any other ideas?

(problem is a simpler version of a derivative default map issue I'm
having)

Thanks for any help.
 
P

Phil Endecott

If you had a file containing boundaries [A,B] (up to 25 million) and a
file containing a list of points P (up to 10 billion). What would be
the fastest way to determine which of the listed boundaries contain a
point P?

The 25 million points will happily fit in memory.

Make a binary tree where each node represents an interval min to max,
and the two sub-trees represent intervals min to mid and mid to max.
Then store intervals that cross mid in the node. Pseudo-code:

class Node {
point_t min;
point_t mid;
point_t max;

Node_ptr leftnode;
Node_ptr rightnode;

set< pair<point_t,point_t> > cross_mid;

public:
Node(point_t min_, point_t max_):
min(min_), mid((min_+max_)/2), max(max_),
leftnode(new Node(min,mid)), // you need to postpone this until
rightnode(new Node(mid,max)) // it's needed
{}

void insert(pair<point_t,point_t> b) {
if (b.second<mid) left->insert(b)
else if (b.first>mid) right->insert(b)
else cross_mid.insert(b);
}

set< pair<point_t,point_t> > find(point_t p) {
return cross_mid.find(p) // left as an exercise :)
union ( (p<mid) ? leftnode : rightnode ) -> find(p);
}
};


Like all binary trees, this suffers if it becomes unbalanced. Consider
inserting in a random order, perhaps. (Are your files ordered?)

The subdivision will work best if the ranges are fairly evenly
distributed. In this case, the inserts should be O(N log N) and the
lookups each O(log N + M). If the ranges are strongly clustered some
other approach might be better.


Phil.
 
A

ahmadcorp.m

Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree.

We need to determine all applicable boundary values for a specific
point. Eg. point 3 is contained within 2000 of the 25 million
boundaries specified.

A binary tree would imply multiple searches to find all applicable
boundaries. Or some trickery based on the depth of node on a tree.
 
P

Phil Endecott

Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree.

We need to determine all applicable boundary values for a specific
point. Eg. point 3 is contained within 2000 of the 25 million
boundaries specified.

A binary tree would imply multiple searches to find all applicable
boundaries.

No, why?

As you descend the tree, you accumulate all the ranges that include your
point.
 
D

Dario Saccavino

Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree.

We need to determine all applicable boundary values for a specific
point. Eg. point 3 is contained within 2000 of the 25 million
boundaries specified.

After wiki-ing a while, I've found the following:

http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree

Phil's solution is very similar to "Alternative 1" in the first
article and I think it's the most appropriate for your problem.

Dario
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,967
Members
47,516
Latest member
ChrisHibbs

Latest Threads

Top