Selecting Container for "Top 20" Application

M

Mike Copeland

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.
None of the available STL containers seems well suited for this
application. For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...
- vector, while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.
- queue/deque, set, and stack seem inappropriate for this application,
and the others (map, list, etc.) are completely unusable here.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA
 
S

Stefan Ram

Am I missing something? Is/are there reasonably efficient ways to

Interfaces. Do not code to an implementation. Code to an interface.

By »interface« I mean: (abstract) pure-virtual (base) class.

Write that interface to contain the functions you dream of
(the perfect container of your dreams, that is perfectly
suited to your needs)- as if they were already implemented.
Do not think about how to implement them at this time.

Next, implement (using a derived class with implementations
for the function signatures of the interface) them in a way
that optimizes not run-time, but your development time. If
you believe in testing or documentation, at this point, you
now might write documentation or tests for these classes.

Now, you can write your actual program using the dream
container.

Only when you then should observe that it is too slow, you
then can derive another class from the interface with a more
run-time efficient implementation.
 
J

Jorgen Grahn

Interfaces. Do not code to an implementation. Code to an interface.

Are you sure you're replying to the right posting? I cannot see how
run-time polymorphism would help Mr Copeland -- and I cannot see that
he mistakenly tries to code to an implementation.

/Jorgen
 
J

Jorgen Grahn

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.

Do you have all 300K records in a container already for some other
purpose? Or do you e.g. read them from file and really only want the
top 20?

Does ">300,000" mean "slightly more than that" or "the program should
deal gracefully with more entries than will fit into memory"? Because
that's feasible, and sometimes desirable.

In either case, it's not a matter of finding a container, but finding
a suitable algorithm+container pair.

Personally I think I'd settle for:
- accumulating into a std::set<Record>
- trim the set as soon as it grows to size 21
- if the set is at size 20, never try to insert a
record less than the current min (pointless)

That can handle any input data set size. The only drawback I can see
is it's slow if the input is already sorted. (Depending on what you
want it for, that can be a showstopper.)

/Jorgen
 
C

Carlo Milanesi

Mike said:
I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.
None of the available STL containers seems well suited for this
application.

In the standard library there are several sorting algorithms.
Look here for what is more appropriate to you:
http://en.wikibooks.org/wiki/Optimizing_C++/General_optimization_techniques/Sorting#Partial_sorting

If your data is in the vector v, use:
std::nth_element(v.begin(), v.begin() + 20, v.end());
to get in the first 20 places the smallest 20 items of the vector,
unsorted; use the slower:
std::partial_sort(v.begin(), v.begin() + 20, v.end());
if you want them sorted.
 
K

kfrank29.c

Hello Mike!

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.

To clarify, let me add some details I am assuming about
the problem you are trying to solve.

You are reading 300,000 records sequentially from some
"external" source (e.g., a file or database), but not
storing them. The records, as they are read in, are
unordered.

You need to find and store in memory the top 20, as given
by some sort order.

Please correct me if my assumptions are wrong, especially
assuming that you don't need to store the entire 300,000
records.

For the sake of simplicity, I'll assume that all 300,000
records are distinct, and that your sort order is a "total"
order, that is, that for any two (distinct) records, either
A < B or B < A.
None of the available STL containers seems well suited for this
application.

I think std::set would work for this. Note, std::set
is explicitly a sorted container (unlike the new
std::unordered_set, a hash table.).
For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...

Agreed. std::array doesn't sound appropriate.
- vector,

From the perspective of your problem, array and vector are
basically the same, so I don't think vector is appropriate.
while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.

Some more comments:

But if you only need to store a maximum of 20 elements, the
size expansions won't matter. (And even if it did, you could
reserve the needed capacity.)
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.

Agreed. Resorting upon insertion sounds suboptimal.
- queue/deque, set, and stack seem inappropriate for this application,

Queue and stack are container adapters that live on top
of, for example, a std::deque. They and deque don't give
efficient insertion into the middle, and suffer the same
problems as array and vector. (Plus queue and stack hide
part of the interface provided by vector and deque that
you would want.)

But, as I mentioned above, std::set is different.
and the others (map, list, etc.) are completely unusable here.

Well, I wouldn't say "completely unusuable," just
not really appropriate. You could use list, similar
to vector (with similar disadvantages), and you could
use map as if it were a set, simply ignoring the value
component of the key / value pair.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA

I don't see a good way to use array or vector, but
std:set makes sense to me (given my assumptions,
above), as follows:

Loop over your records:

Record r = ...;
if (myTopTwentySet.size() < 20) myToTwentySet.insert (r);
else {
if (r > *myTopTwentySet.begin()) {
myTopTwentySet.erase (myTopTwentySet.begin());
myToTwentySet.insert (r);
}
}

When you're done looping over your records, myTopTwentySet
will contain the 20 largest records (assuming you had
at least 20 records), and, if you care, they will be in
sorted order, that is, myTopTwentySet.begin() will point
to the twentieth largest record, and myTopTwentySet.rbegin()
will point to the largest.

Note that almost all of the time you will only be doing
the test:

if (r > *myTopTwentySet.begin()) {...}

which will not be true, so the details of inserting and
resorting are not really that important as long as you
have a cheap way of keeping track of the smallest of your
current top twenty. For this, std::set is convenient, and
probably nearly optimal, but you won't lose that much by
"rolling your own" (out of array or vector or what-not).

The basic analysis says 300,000 times you test your current
record against your current twentieth best -- cheap, and
(except in perverse cases) -- I'm guessing here -- something
like 20 + log (300,000) times you do something more expensive,
namely replace / resort.


Good luck.


K. Frank
 
S

Stefan Ram

Jorgen Grahn said:
Are you sure you're replying to the right posting? I cannot see how
run-time polymorphism would help Mr Copeland -- and I cannot see that
he mistakenly tries to code to an implementation.

You are right: The run-time part was not necessary. I translated
from Java too literally. Compile-time polymorphism is sufficient here!

Otherwise, I had the impression that the OP worried about
efficiency too early, because he worried that ::std::vector
might be slow. So I stand by the rest of my post.
 
J

Jorgen Grahn

You are right: The run-time part was not necessary. I translated
from Java too literally. Compile-time polymorphism is sufficient here!

Otherwise, I had the impression that the OP worried about
efficiency too early, because he worried that ::std::vector
might be slow. So I stand by the rest of my post.

Yes, and I agree with that part. Too much worrying, and too little
experimentation.

/Jorgen
 
K

K. Frank

Hi Robert!

What am I missing here? A priority queue is the obvious structure.

I think std::priority_queue works. It's also a sorted
container (technically a container adapter), which is
the main point. The only disadvantage I can see depends
on what you want to do with the top 20 after you have
them. The priority_queue hides some of the interface of
its underlying container, and that might be inconvenient.


Thanks.


K. Frank
 
K

K. Frank

Hello Sam!

...
std::set, std::multiset, std::map, or std::multimap will always iterate in
key order.

Put your records into one of these four, whichever one's appropriate,
specifying a comparison functor that order on your key value.

The disadvantage of this is you end up paying the cost
of sorting (and also storing) the entire set of 300,000
records.
Then iterate over the first twenty elements of your container.

This will indeed give you what you want (but at the extra
cost mentioned above).


Best.


K. Frank
 
M

Marcel Müller

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have>300,000 data records that I must scan
to find and store the highest 20 values.

300.000 applications or 300.000 hits from a smaller set of applications.

300.000 applications:
That's easy. You need a sorted array with 20 elements. As soon as an
application reaches Top 21 it is finally out of scope. If the counter of
an application increases, you can check whether it fits into the top 20
set by simply comparing against the last element. Due to the nature of
this problem, the counters can never decrease. So you never need to look
for the 21st element that may become the 20th.

300.000 hits:
Group by the application key and then go as above. For the group by a
hash_map said:
None of the available STL containers seems well suited for this
application. For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available;

Why do you need delete/erase?
insertion at some point is difficult;

It's O(n).
- vector, while having insert& erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.

Moving around 20 elements should be slow?


By the way: your problem is easily solvable by a recursive, distributed
algorithm. As set of top 20 can always be created from the top 20 of any
subsets. So you can partition your data as you like and take the top 20.
At the end all you need is a merge sort of the different top 20 lists
for the subsets to get the final result.


Marcel
 
Ö

Öö Tiib

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.

None of the available STL containers seems well suited for this
application. For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...

- vector, while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.

Have you measured how slow is to push_back() all 300K records to vector
and then to call std::sort once to sort it?
- queue/deque, set, and stack seem inappropriate for this application,
and the others (map, list, etc.) are completely unusable here.

You describe that you have set of records from what you want to get top
20. I do not understand why std::set is not appropriate. Other option I
use on such cases is priority_queue.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA

priority_queue is adapter around vector or deque.

You said in some other thread that you do not have profiler. If you want
to analyze performance of application and compare different containers in
context of your application then you perhaps should find a profiler.
 
Ö

Öö Tiib

Other option I use on such cases is priority_queue.

Forgot to elaborate how I typically use it:

fill priority_queue with 20 elements (typically iterators to other container)
while there are more elements to consider
push one more element
pop bottom element

As result you have 20 top elements in priority_queue.
 
P

Paul N

I am looking for an STL container that will efficiently handle a "Top

20" list. Specifically, I have >300,000 data records that I must scan

to find and store the highest 20 values.

None of the available STL containers seems well suited for this

application. For example:

- array, even though fixed in size, seems cumbersome to use: no delete

or erase function is available; insertion at some point is difficult;

etc. Replacement of the "end value" and sorting might work, but it's

tedious and slow...

- vector, while having insert & erase functions, incurs repeated size

expansion (expensive!) and truncation to maintain a fixed size.

Appending a new value if it's greater than the 20th element, followed by

sorting and truncation might be less work, but it might be very slow to

execute.

- queue/deque, set, and stack seem inappropriate for this application,

and the others (map, list, etc.) are completely unusable here.

Am I missing something? Is/are there reasonably efficient ways to

use array or vector that are appropriate? TIA

Can't you use something like a linked list? How about the following:

For first 20 records:

allocate space
read in record
run through list and insert record at appropriate point

You then allocate a further space, and set three pointers - head to point at the first of your 20 records, tail to point at the last, and spare to point at the spare one.

Now, until end of file:
read a record into spare
compare spare with tail
- if spare is less, it's not in the top 20 so repeat, but if not:
run through list inserting record at correct place
set tail to point at the record which is now 20th
set spare to point at the old tail, which can now be over-written

Then at the end just deallocate spare. You only need 21 allocations no matter how many records you have (if you're really trying you could allocate one big chunk for all of them in one go) because you reuse the space. Insertions are quick because for a linked list you don't move anything, you just change pointers. And you only need to run through the list when you've got arecord that is in the top 20 so far.
 
N

Norman J. Goldstein

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.
None of the available STL containers seems well suited for this
application. For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...
- vector, while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.
- queue/deque, set, and stack seem inappropriate for this application,
and the others (map, list, etc.) are completely unusable here.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA
Well, I have read all the comments. Here are mine:

Use a balanced binary tree.
For the first 20 elements, do insert().
For any further elements, do find() and replace() .
You write the replace() method to just change an existing
value with the new one. This way, the set stays at 20 elements,
and rebalancing is not needed on the internal tree. You will need
at most 5 comparisons for each find() and will need to disambiguate
left and right branches when dong the replace -- always replace
the next smallest element.

I don't know if any std structure allows this fine control.
I checked std::set, and find() returns a set::end iterator, rather
than where it would have inserted the new element.
 
N

Norman J. Goldstein

Use a balanced binary tree.
For the first 20 elements, do insert().
For any further elements, do find() and replace() .
You write the replace() method to just change an existing
value with the new one. This way, the set stays at 20 elements,
and rebalancing is not needed on the internal tree. You will need
at most 5 comparisons for each find() and will need to disambiguate
left and right branches when dong the replace -- always replace
the next smallest element.

I don't know if any std structure allows this fine control.
I checked std::set, and find() returns a set::end iterator, rather
than where it would have inserted the new element.
Sorry, please ignore this response of mine. Wrong algirmthm
 
J

Jorgen Grahn

....

Not to pick on you specifically Paul (the post had to go somewhere!),
but has the state of programming knowledge really deteriorated so much
that almost no one knows what a priority queue is?

Surely you can't use std::priority_queue without placing all 300K
entries in it? I think that's the main reason people are avoiding
it.

Perhaps that space requirement doesn't matter to the OP, but
it's tempting to try to avoid it, when it's obvious that it
/is/ avoidable.

/Jorgen
 
Ö

Öö Tiib

Surely you can't use std::priority_queue without placing all 300K
entries in it? I think that's the main reason people are avoiding
it.

Sorry if I misunderstood some sarcasm but if result of algorithm is 20
elements then why you need more than 20 elements in priority queue?
Perhaps that space requirement doesn't matter to the OP, but
it's tempting to try to avoid it, when it's obvious that it
/is/ avoidable.

Priority queue does only heapify those 20 elements instead of sorting.
 
S

Sebastian Pfützner

std::vector and std::nth_element without any doubts :)

Not if performance is important. :)

If n is the element count and m the number of top-elements you have:

Filling vector: O(n)
Finding top-elements via nth_element: m*O(n)

Filling heap: O(n)
Finding top-elements via heap-extract: m*O(log n)

So a heap can be the better choice. It would be interesting to measure,
if this is true with n=300k and m=20.

Boost has some implementations of heaps with various complexities of
different operations.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top