? about priority_queue how make objects unique

R

Rich S.

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so


------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.
 
M

mlimber

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so


------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M
 
K

Kai-Uwe Bux

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so


------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the best
way.

If you care about uniqueness, then maybe std::set is the way to go. Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N


Best

Kai-Uwe Bux
 
K

Kristo

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

[snip old message]
This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.
I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

That's too much work IMHO, and probably inefficient. Just let the set
do that for you.

Kristo
 
M

Mark P

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so


------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.
 
N

niklasb

Kristo said:
Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

[snip old message]
This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}
 
R

Rich S.

mlimber said:
Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M
yes duplicates in my data which I expected. I switched to a set based
solution and it works good so far.

Thanks for the tip about unique that'll be handy one day.
 
R

Rich S.

Kristo said:
Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data
records
keeping the top 2000 (where the top values are based upon a field in
the
record) and niklasb suggested using a priority_queue like so

[snip old message]
This is working well for me but I'm getting cases where I have
duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

I switched to a set based solution and it works pretty good.

Thanks
 
R

Rich S.

Mark P said:
I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.

I switched to a set but was thinking about a map solution for speed.
 
R

Rich S.

Kai-Uwe Bux said:
If you care about uniqueness, then maybe std::set is the way to go.
Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N


Best

Kai-Uwe Bux

Thanks, I did switch to a set solution and it works well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top