Logic check please

R

Rich S.

I'm reading through records in a file creating bitset objects from them.
Each bitset object has a score and I'm trying to collect just the top 2000
scores from over a million records.

I tihnk the best way to go is to create a vector, add the objects to it in
ascending order while keeping track of the size and lowest score. If the
size reaches the 2000 mark then evaluate the score and if it's lower than
the lowest score, discard the object otherwise traverse the vector and
insert it at its appropriate space and finally chop off the beginning of the
vector so that we're still in the 2000 range.

When I say best way I mean the way I thought of given my somewhat limited
experience with stl containers. I was also thinking about determining if the
object belongs in the 2000 set based upon it's score and then tossing it in
the end of the vector and doing a vector sort but I figured I'd skip the
sort and insert it myself at the proper position.

I didn't see anything that sets a vectors size as fixed and only inserts if
above this range type of thing but I guess that's where you override the
allocater? I haven't been down that road but nows a good time to jump in I
guess.

Thanks

This convoluted method below if you care to read and process is what I came
up with.

iSize = m_BitSetVector.size();

// if the vector is empty then just add it and collect the score
if(iSize == 0)
{
iLowestTotal = currentSet.GetScore();
m_BitSetVector.push_back(currentSet);
continue;
}


bool bAtFullCapacity = false;
if(m_BitSetVector.size() > m_iTotalScoresToKeep)
{
bAtFullCapacity = true;

// if the score is less than the lowest in the vector and
// we're at full capacity then continue
if(currentSet.GetScore() < m_BitSetVector.begin()->GetScore())
continue;
}

BitSetVector::iterator p;
bool bInserted = false;

for(p = m_BitSetVector.begin(); p != m_BitSetVector.end(); ++p)
{
if(currentSet.GetScore() < p->GetScore())
{
// insert at this location
m_BitSetVector.insert(p, currentSet);
bInserted = true;

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}

break;
}
}

// if it's not inserted yet then
if(!bInserted)
{
m_BitSetVector.push_back(currentSet);

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}
}
 
N

niklasb

Rich said:
I'm reading through records in a file creating bitset objects from them.
Each bitset object has a score and I'm trying to collect just the top 2000
scores from over a million records.

A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}
 
G

Greg

Rich said:
I'm reading through records in a file creating bitset objects from them.
Each bitset object has a score and I'm trying to collect just the top 2000
scores from over a million records.

I tihnk the best way to go is to create a vector, add the objects to it in
ascending order while keeping track of the size and lowest score. If the
size reaches the 2000 mark then evaluate the score and if it's lower than
the lowest score, discard the object otherwise traverse the vector and
insert it at its appropriate space and finally chop off the beginning of the
vector so that we're still in the 2000 range.

When I say best way I mean the way I thought of given my somewhat limited
experience with stl containers. I was also thinking about determining if the
object belongs in the 2000 set based upon it's score and then tossing it in
the end of the vector and doing a vector sort but I figured I'd skip the
sort and insert it myself at the proper position.

I didn't see anything that sets a vectors size as fixed and only inserts if
above this range type of thing but I guess that's where you override the
allocater? I haven't been down that road but nows a good time to jump in I
guess.

Thanks

This convoluted method below if you care to read and process is what I came
up with.

iSize = m_BitSetVector.size();

// if the vector is empty then just add it and collect the score
if(iSize == 0)
{
iLowestTotal = currentSet.GetScore();
m_BitSetVector.push_back(currentSet);
continue;
}


bool bAtFullCapacity = false;
if(m_BitSetVector.size() > m_iTotalScoresToKeep)
{
bAtFullCapacity = true;

// if the score is less than the lowest in the vector and
// we're at full capacity then continue
if(currentSet.GetScore() < m_BitSetVector.begin()->GetScore())
continue;
}

BitSetVector::iterator p;
bool bInserted = false;

for(p = m_BitSetVector.begin(); p != m_BitSetVector.end(); ++p)
{
if(currentSet.GetScore() < p->GetScore())
{
// insert at this location
m_BitSetVector.insert(p, currentSet);
bInserted = true;

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}

break;
}
}

// if it's not inserted yet then
if(!bInserted)
{
m_BitSetVector.push_back(currentSet);

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}
}

Just call std::nth_element() to group the 2,000 highest items together
at the end of their container.

Here's a program to illustrate:

#include <vector>
#include <algorithm>
#include <iostream>

int main()
{
const int kTop2K = 2000;

std::vector<int> numberV;

// add numbers 1 - 1,000,000
for (int i=0; i < 10000000; i++)
numberV.push_back( i);

// shuffle the vector's contents
std::random_shuffle( numberV.begin(), numberV.end());

// visually confirm items are in random order
std::cout << "Last 2,000 elements (shuffled) " << std::endl;
for ( int i= numberV.size() - kTop2K; i < numberV.size() ; i++)
std::cout << numberV << " ";
std::cout << std::endl << std::endl;

assert( numberV.size() >= kTop2K);

// use nth_element with the nth element being the 2,000th
// position from the end of the container

std::nth_element(numberV.begin(),
numberV.end()-kTop2K,
numberV.end());

// the top 2,000 items have been moved after the nth element
// all lower ranked items are are now before nth element.
// Note that neither range has been completely sorted

std::cout << "Last 2,000 elements (final): " << std::endl;

// visually inspect results.
// all numbers listed should be > 998,000

for (int i=numberV.size() - kTop2K; i < numberV.size() ; i++)
std::cout << numberV << " ";
}

Greg
 
R

Rich S.

A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}


I tried it and it worked but the only down side is that when I go to dump
the 2000 they're are in ascending order and I wish them to be in descending
order. There must be a way to dump the p queue into a vector quickly and
sort that and then dump it.

Thanks, the logic is much cleaner now.
 
R

Rich S.

Greg said:
Rich said:
I'm reading through records in a file creating bitset objects from them.
Each bitset object has a score and I'm trying to collect just the top
2000
scores from over a million records.

I tihnk the best way to go is to create a vector, add the objects to it
in
ascending order while keeping track of the size and lowest score. If the
size reaches the 2000 mark then evaluate the score and if it's lower than
the lowest score, discard the object otherwise traverse the vector and
insert it at its appropriate space and finally chop off the beginning of
the
vector so that we're still in the 2000 range.

When I say best way I mean the way I thought of given my somewhat limited
experience with stl containers. I was also thinking about determining if
the
object belongs in the 2000 set based upon it's score and then tossing it
in
the end of the vector and doing a vector sort but I figured I'd skip the
sort and insert it myself at the proper position.

I didn't see anything that sets a vectors size as fixed and only inserts
if
above this range type of thing but I guess that's where you override the
allocater? I haven't been down that road but nows a good time to jump in
I
guess.

Thanks

This convoluted method below if you care to read and process is what I
came
up with.

iSize = m_BitSetVector.size();

// if the vector is empty then just add it and collect the score
if(iSize == 0)
{
iLowestTotal = currentSet.GetScore();
m_BitSetVector.push_back(currentSet);
continue;
}


bool bAtFullCapacity = false;
if(m_BitSetVector.size() > m_iTotalScoresToKeep)
{
bAtFullCapacity = true;

// if the score is less than the lowest in the vector and
// we're at full capacity then continue
if(currentSet.GetScore() < m_BitSetVector.begin()->GetScore())
continue;
}

BitSetVector::iterator p;
bool bInserted = false;

for(p = m_BitSetVector.begin(); p != m_BitSetVector.end(); ++p)
{
if(currentSet.GetScore() < p->GetScore())
{
// insert at this location
m_BitSetVector.insert(p, currentSet);
bInserted = true;

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}

break;
}
}

// if it's not inserted yet then
if(!bInserted)
{
m_BitSetVector.push_back(currentSet);

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}
}

Just call std::nth_element() to group the 2,000 highest items together
at the end of their container.

Here's a program to illustrate:

#include <vector>
#include <algorithm>
#include <iostream>

int main()
{
const int kTop2K = 2000;

std::vector<int> numberV;

// add numbers 1 - 1,000,000
for (int i=0; i < 10000000; i++)
numberV.push_back( i);

// shuffle the vector's contents
std::random_shuffle( numberV.begin(), numberV.end());

// visually confirm items are in random order
std::cout << "Last 2,000 elements (shuffled) " << std::endl;
for ( int i= numberV.size() - kTop2K; i < numberV.size() ; i++)
std::cout << numberV << " ";
std::cout << std::endl << std::endl;

assert( numberV.size() >= kTop2K);

// use nth_element with the nth element being the 2,000th
// position from the end of the container

std::nth_element(numberV.begin(),
numberV.end()-kTop2K,
numberV.end());

// the top 2,000 items have been moved after the nth element
// all lower ranked items are are now before nth element.
// Note that neither range has been completely sorted

std::cout << "Last 2,000 elements (final): " << std::endl;

// visually inspect results.
// all numbers listed should be > 998,000

for (int i=numberV.size() - kTop2K; i < numberV.size() ; i++)
std::cout << numberV << " ";
}

Greg


That's good and I'll definately keep that snippet for later use but I don't
think I want to store all of the objects in memory at the same time because
I'll quickly run out of memory.
 
K

Karl Heinz Buchegger

Rich S. said:
I tried it and it worked but the only down side is that when I go to dump
the 2000 they're are in ascending order and I wish them to be in descending
order. There must be a way to dump the p queue into a vector quickly

A loop ?

std::vector<T> Result;
Result.reserve( N );

while( !q.empty() )
Result.push_back( q.top() );
q.pop();
}
and
sort that

No need to sort it.
The items poped from the queue already are sorted. Just iterate
through the vector in reverse order.
and then dump it.

If you don't want to create that vector, you could eg. use
a recursive function to flip the order:

void Dump( priority_queue< T, std::vector<int>, MyLess<int> > q )
{
if( q.empty() )
return;

T Value = q.top();
q.pop();

Dump( q );
cout << Value << endl;
}

but I doubt, that this is faster, clearer or uses less memory space then
the vector.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top