Hello Mike!
I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.
To clarify, let me add some details I am assuming about
the problem you are trying to solve.
You are reading 300,000 records sequentially from some
"external" source (e.g., a file or database), but not
storing them. The records, as they are read in, are
unordered.
You need to find and store in memory the top 20, as given
by some sort order.
Please correct me if my assumptions are wrong, especially
assuming that you don't need to store the entire 300,000
records.
For the sake of simplicity, I'll assume that all 300,000
records are distinct, and that your sort order is a "total"
order, that is, that for any two (distinct) records, either
A < B or B < A.
None of the available STL containers seems well suited for this
application.
I think std::set would work for this. Note, std::set
is explicitly a sorted container (unlike the new
std::unordered_set, a hash table.).
For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...
Agreed. std::array doesn't sound appropriate.
From the perspective of your problem, array and vector are
basically the same, so I don't think vector is appropriate.
while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Some more comments:
But if you only need to store a maximum of 20 elements, the
size expansions won't matter. (And even if it did, you could
reserve the needed capacity.)
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.
Agreed. Resorting upon insertion sounds suboptimal.
- queue/deque, set, and stack seem inappropriate for this application,
Queue and stack are container adapters that live on top
of, for example, a std::deque. They and deque don't give
efficient insertion into the middle, and suffer the same
problems as array and vector. (Plus queue and stack hide
part of the interface provided by vector and deque that
you would want.)
But, as I mentioned above, std::set is different.
and the others (map, list, etc.) are completely unusable here.
Well, I wouldn't say "completely unusuable," just
not really appropriate. You could use list, similar
to vector (with similar disadvantages), and you could
use map as if it were a set, simply ignoring the value
component of the key / value pair.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA
I don't see a good way to use array or vector, but
std:set makes sense to me (given my assumptions,
above), as follows:
Loop over your records:
Record r = ...;
if (myTopTwentySet.size() < 20) myToTwentySet.insert (r);
else {
if (r > *myTopTwentySet.begin()) {
myTopTwentySet.erase (myTopTwentySet.begin());
myToTwentySet.insert (r);
}
}
When you're done looping over your records, myTopTwentySet
will contain the 20 largest records (assuming you had
at least 20 records), and, if you care, they will be in
sorted order, that is, myTopTwentySet.begin() will point
to the twentieth largest record, and myTopTwentySet.rbegin()
will point to the largest.
Note that almost all of the time you will only be doing
the test:
if (r > *myTopTwentySet.begin()) {...}
which will not be true, so the details of inserting and
resorting are not really that important as long as you
have a cheap way of keeping track of the smallest of your
current top twenty. For this, std::set is convenient, and
probably nearly optimal, but you won't lose that much by
"rolling your own" (out of array or vector or what-not).
The basic analysis says 300,000 times you test your current
record against your current twentieth best -- cheap, and
(except in perverse cases) -- I'm guessing here -- something
like 20 + log (300,000) times you do something more expensive,
namely replace / resort.
Good luck.
K. Frank