Speeding up access to STL vectors?

S

Steve555

Hi

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains
this struct:

typedef struct{
unsigned short movieID;
unsigned short rating;
} movieRatingPair;

....and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Many thanks for any advice.

Steve


(*)There are nearly half million instances of these vectors. (1 per
customer which are stored in an stl map, keyed by their userID.)
In total, therefore, there are 100+ million records. It's effectively
a read-only database, with the data being initially fetched from disk,
and then the data MUST be kept in 2.5Gb of free memory for quick
access/analysis.
Unfortunately, using maps instead of vectors means each record is 32
bytes * 100+million = 3Gb
 
K

Kai-Uwe Bux

Steve555 said:
Hi

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains
this struct:

typedef struct{
unsigned short movieID;
unsigned short rating;
} movieRatingPair;

a) You can ditch the typedef:

struct movieRatingPair {
unsigned short movieID;
unsigned short rating;
};

will do.

b) Is there a reason to use vector<movieRatingPair*> instead of
vector<movieRatingPair>? Note that dynamic allocation of objects usually
has some overhead: (1) the pointer (2) some size information about the
allocated object. You might end up using 12 bytes (4 in the pointer and 8
in the pointee) instead of 4 (which is what two unsigned shorts might give
you, but that is platform specific).
...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Provide a functor to compare movieRatingPair (or movieRatingPair*) by the
movieID. Then, sort the vector using std::sort() and then test for entries
with std::binary_search(). You can also find iterators to the entries with
std::lower_bound() and std::upper_bound().


[snip]


Best

Kai-Uwe Bux
 
J

Juha Nieminen

Steve555 said:
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID.

std::binary_search()? (Or if you want to get hold of the found
element, std::lower_bound().)
 
P

peter koch

Hi

My vector (   typedef   vector<movieRatingPair*> ratingsV    )  contains

Why do you have a pointer of vectors? You probably should have a
vector of MovieRatingPair.
this struct:

typedef struct{
        unsigned short  movieID;
        unsigned short  rating;

} movieRatingPair;

Why do you use a C-notation? In C++ the preferred notation is
struct movieRatingPair
{
...
};
...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.
There is std::binary_search, std::lower_bound and std::upper_bound for
that. Using that, having much better cache-coherency when searching
values instead of pointers and using far less memory (when not newing
your elements) should combine to give you a rather huge speed-up (I
would estimate a factor of more than ten).

/Peter
 
S

Steve555

  std::binary_search()? (Or if you want to get hold of the found
element, std::lower_bound().)

Thanks Juha, but as I said to Kai, I can't find any documentation on
how to use std::binary_search() on a vector that contains a struct.
 
S

Steve555

Why do you have a pointer of vectors? You probably should have a
vector of MovieRatingPair.




Why do you use a C-notation? In C++ the preferred notation is
struct movieRatingPair
{
  ...};




There is std::binary_search, std::lower_bound and std::upper_bound for
that. Using that, having much better cache-coherency when searching
values instead of pointers and using far less memory (when not newing
your elements) should combine to give you a rather huge speed-up (I
would estimate a factor of more than ten).

/Peter

Thanks, I like your estimate! Unfortunately I've got really stuck on
figuring out how to use binary_search
with a vector that contains structs. Any hints anyone can give me
using my data types would be greatly appreciated.
 
S

Steve555

Steve555 said:
My vector (   typedef vector<movieRatingPair*> ratingsV    )  contains
this struct:
typedef struct{
unsigned short        movieID;
unsigned short        rating;
} movieRatingPair;

a) You can ditch the typedef:

  struct movieRatingPair {
    unsigned short movieID;
    unsigned short rating;
  };

will do.

b) Is there a reason to use vector<movieRatingPair*> instead of
vector<movieRatingPair>? Note that dynamic allocation of objects usually
has some overhead: (1) the pointer (2) some size information about the
allocated object. You might end up using 12 bytes (4 in the pointer and 8
in the pointee) instead of 4 (which is what two unsigned shorts might give
you, but that is platform specific).
...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.
I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)
Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Provide a functor to compare movieRatingPair (or movieRatingPair*) by the
movieID. Then, sort the vector using std::sort() and then test for entries
with std::binary_search(). You can also find iterators to the entries with
std::lower_bound() and std::upper_bound().

[snip]

Best

Kai-Uwe Bux

Thanks Kai but I'm completely lost with getting binary_search to work
with a vector containing structs. They are already sorted upon
creation, so I didn't understand how re-sorting them would help.
 
P

peter koch

Thanks, I like your estimate! Unfortunately I've got really stuck on
figuring out how to use binary_search
with a vector that contains structs. Any hints anyone can give me
using my data types would be greatly appreciated.

One way would be to make your structs comparable, but to me this seems
wrong as you seemingly only want to compare part of the struct.

So write a function or a functor that compares the movieID's of two
movieRatingPairs. The functor way would be:

struct lessmovieIDs
{
bool operator()(movieRatingPair const& lhs,movieRatingPair const&
rhs)
// returns true iff the movieID of lhs is less than the movieID of
rhs
{
...
}
};

/Peter
 
K

Kai-Uwe Bux

Steve555 said:
Steve555 said:
My vector (   typedef vector<movieRatingPair*> ratingsV    )  contains
this struct:
typedef struct{
unsigned short        movieID;
unsigned short        rating;
} movieRatingPair;

a) You can ditch the typedef:

struct movieRatingPair {
unsigned short movieID;
unsigned short rating;
};

will do.

b) Is there a reason to use vector<movieRatingPair*> instead of
vector<movieRatingPair>? Note that dynamic allocation of objects usually
has some overhead: (1) the pointer (2) some size information about the
allocated object. You might end up using 12 bytes (4 in the pointer and 8
in the pointee) instead of 4 (which is what two unsigned shorts might
give you, but that is platform specific).
...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.
I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)
Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Provide a functor to compare movieRatingPair (or movieRatingPair*) by the
movieID. Then, sort the vector using std::sort() and then test for
entries with std::binary_search(). You can also find iterators to the
entries with std::lower_bound() and std::upper_bound().

[snip]

Best

Kai-Uwe Bux

Thanks Kai but I'm completely lost with getting binary_search to work
with a vector containing structs. They are already sorted upon
creation, so I didn't understand how re-sorting them would help.

It wouldn't. I missed the part about the vector being sorted already.

Here is a bit of code that might help you:

#include <vector>
#include <algorithm>
#include <cassert>

struct rating {

unsigned short ID;
unsigned short value;

};

rating make_rating ( unsigned short id, unsigned short val ) {
rating result;
result.ID = id;
result.value = val;
return ( result );
}

bool compare_by_ID ( rating lhs, rating rhs ) {
return ( lhs.ID < rhs.ID );
}

int main ( void ) {
std::vector< rating > r;
r.push_back( make_rating( 1, 23 ) );
r.push_back( make_rating( 13, 10 ) );
r.push_back( make_rating( 24, 1 ) );
assert( std::binary_search( r.begin(), r.end(),
make_rating( 13, 0 ), &compare_by_ID ) );
assert( ! std::binary_search( r.begin(), r.end(),
make_rating( 10, 0 ), &compare_by_ID ) );
}


Best

Kai-Uwe Bux
 
S

Steve555

It's no different from a vector that contains primitive types, once you
define a comparison operator.

     /* Compares values by movie_id.  Ratings are ignored. */
     bool operator<(movie_rating_pair lhs, movie_rating_pair rhs) {
         return lhs.movie_id < rhs.movie_id;
     }

     ...

     movie_rating_pair sought = { 837, 3 };
     if (std::binary_search(pairs.begin(), pairs.end(), sought)) {
         ...
     }

Thanks a lot chaps, those code examples really helped.
Using binary_search ( or lower_bound) and using values rather than
pointers, the time for 10 million queries has dropped from 92 to 6.5
seconds!
(You were right Peter)
And I'm using almost 1Gb less memory.
Now that's what I call a result!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top