Speeding up access to STL vectors?

Steve555 · Dec 5, 2008

Hi

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains
this struct:

typedef struct{
unsigned short movieID;
unsigned short rating;
} movieRatingPair;

....and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Many thanks for any advice.

Steve

(*)There are nearly half million instances of these vectors. (1 per
customer which are stored in an stl map, keyed by their userID.)
In total, therefore, there are 100+ million records. It's effectively
a read-only database, with the data being initially fetched from disk,
and then the data MUST be kept in 2.5Gb of free memory for quick
access/analysis.
Unfortunately, using maps instead of vectors means each record is 32
bytes * 100+million = 3Gb

Kai-Uwe Bux · Dec 5, 2008

Steve555 said:
Hi

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains
this struct:

typedef struct{
unsigned short movieID;
unsigned short rating;
} movieRatingPair;

a) You can ditch the typedef:

struct movieRatingPair {
unsigned short movieID;
unsigned short rating;
};

will do.

b) Is there a reason to use vector<movieRatingPair*> instead of
vector<movieRatingPair>? Note that dynamic allocation of objects usually
has some overhead: (1) the pointer (2) some size information about the
allocated object. You might end up using 12 bytes (4 in the pointer and 8
in the pointee) instead of 4 (which is what two unsigned shorts might give
you, but that is platform specific).

...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Provide a functor to compare movieRatingPair (or movieRatingPair*) by the
movieID. Then, sort the vector using std::sort() and then test for entries
with std::binary_search(). You can also find iterators to the entries with
std::lower_bound() and std::upper_bound().

[snip]

Best

Kai-Uwe Bux

Juha Nieminen · Dec 5, 2008

Steve555 said:
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID.

std::binary_search()? (Or if you want to get hold of the found
element, std::lower_bound().)

joecook · Dec 5, 2008

Hi

Unfortunately, using maps instead of vectors means each record is 32
bytes * 100+million = 3Gb

You might also consider looking at the functions make_heap(), and the
related ones.

Joe

peter koch · Dec 5, 2008

Hi

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains

Why do you have a pointer of vectors? You probably should have a
vector of MovieRatingPair.

this struct:

typedef struct{
unsigned short movieID;
unsigned short rating;

} movieRatingPair;

Why do you use a C-notation? In C++ the preferred notation is
struct movieRatingPair
{
...
};

...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

There is std::binary_search, std::lower_bound and std::upper_bound for
that. Using that, having much better cache-coherency when searching
values instead of pointers and using far less memory (when not newing
your elements) should combine to give you a rather huge speed-up (I
would estimate a factor of more than ten).

/Peter

Steve555 · Dec 5, 2008

std::binary_search()? (Or if you want to get hold of the found
element, std::lower_bound().)

Thanks Juha, but as I said to Kai, I can't find any documentation on
how to use std::binary_search() on a vector that contains a struct.

Steve555 · Dec 5, 2008

Why do you have a pointer of vectors? You probably should have a
vector of MovieRatingPair.

Why do you use a C-notation? In C++ the preferred notation is
struct movieRatingPair
{
...};

There is std::binary_search, std::lower_bound and std::upper_bound for
that. Using that, having much better cache-coherency when searching
values instead of pointers and using far less memory (when not newing
your elements) should combine to give you a rather huge speed-up (I
would estimate a factor of more than ten).

/Peter

Thanks, I like your estimate! Unfortunately I've got really stuck on
figuring out how to use binary_search
with a vector that contains structs. Any hints anyone can give me
using my data types would be greatly appreciated.

Steve555 · Dec 5, 2008

Steve555 said:
Steve555 said:

Hi

Click to expand...

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains
this struct:

Click to expand...

typedef struct{
unsigned short movieID;
unsigned short rating;
} movieRatingPair;

Click to expand...

a) You can ditch the typedef:

struct movieRatingPair {
unsigned short movieID;
unsigned short rating;
};

will do.

b) Is there a reason to use vector<movieRatingPair*> instead of
vector<movieRatingPair>? Note that dynamic allocation of objects usually
has some overhead: (1) the pointer (2) some size information about the
allocated object. You might end up using 12 bytes (4 in the pointer and 8
in the pointee) instead of 4 (which is what two unsigned shorts might give
you, but that is platform specific).

...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

Click to expand...

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Click to expand...

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Click to expand...

Provide a functor to compare movieRatingPair (or movieRatingPair*) by the
movieID. Then, sort the vector using std::sort() and then test for entries
with std::binary_search(). You can also find iterators to the entries with
std::lower_bound() and std::upper_bound().

[snip]

Best

Kai-Uwe Bux

Thanks Kai but I'm completely lost with getting binary_search to work
with a vector containing structs. They are already sorted upon
creation, so I didn't understand how re-sorting them would help.

peter koch · Dec 5, 2008

Thanks, I like your estimate! Unfortunately I've got really stuck on
figuring out how to use binary_search
with a vector that contains structs. Any hints anyone can give me
using my data types would be greatly appreciated.

One way would be to make your structs comparable, but to me this seems
wrong as you seemingly only want to compare part of the struct.

So write a function or a functor that compares the movieID's of two
movieRatingPairs. The functor way would be:

struct lessmovieIDs
{
bool operator()(movieRatingPair const& lhs,movieRatingPair const&
rhs)
// returns true iff the movieID of lhs is less than the movieID of
rhs
{
...
}
};

/Peter

Kai-Uwe Bux · Dec 5, 2008

Steve555 said:
Steve555 said:

Hi

Click to expand...

My vector ( typedef vector<movieRatingPair*> ratingsV ) contains
this struct:

Click to expand...

typedef struct{
unsigned short movieID;
unsigned short rating;
} movieRatingPair;

Click to expand...

a) You can ditch the typedef:

struct movieRatingPair {
unsigned short movieID;
unsigned short rating;
};

will do.

b) Is there a reason to use vector<movieRatingPair*> instead of
vector<movieRatingPair>? Note that dynamic allocation of objects usually
has some overhead: (1) the pointer (2) some size information about the
allocated object. You might end up using 12 bytes (4 in the pointer and 8
in the pointee) instead of 4 (which is what two unsigned shorts might
give you, but that is platform specific).

...and on average will contain 200 such structs;
The structs are always stored in the vector in ascending order of
movieID.

Click to expand...

I need to find a quick way to check if the vector contains a specific
movieID. I can not use a map instead of a vector, because the extra
few bytes they use creates a problem.(*)

Click to expand...

Simply iterating through the vector is proving a bit slow; I guess I
can write some binary search code, but just wanted to check I wasn't
re-inventing the wheel, or simply missing a much better way of dealing
with the whole thing.

Click to expand...

Provide a functor to compare movieRatingPair (or movieRatingPair*) by the
movieID. Then, sort the vector using std::sort() and then test for
entries with std::binary_search(). You can also find iterators to the
entries with std::lower_bound() and std::upper_bound().

[snip]

Best

Kai-Uwe Bux

Click to expand...

Thanks Kai but I'm completely lost with getting binary_search to work
with a vector containing structs. They are already sorted upon
creation, so I didn't understand how re-sorting them would help.

It wouldn't. I missed the part about the vector being sorted already.

Here is a bit of code that might help you:

#include <vector>
#include <algorithm>
#include <cassert>

struct rating {

unsigned short ID;
unsigned short value;

};

rating make_rating ( unsigned short id, unsigned short val ) {
rating result;
result.ID = id;
result.value = val;
return ( result );
}

bool compare_by_ID ( rating lhs, rating rhs ) {
return ( lhs.ID < rhs.ID );
}

int main ( void ) {
std::vector< rating > r;
r.push_back( make_rating( 1, 23 ) );
r.push_back( make_rating( 13, 10 ) );
r.push_back( make_rating( 24, 1 ) );
assert( std::binary_search( r.begin(), r.end(),
make_rating( 13, 0 ), &compare_by_ID ) );
assert( ! std::binary_search( r.begin(), r.end(),
make_rating( 10, 0 ), &compare_by_ID ) );
}

Best

Kai-Uwe Bux

Steve555 · Dec 5, 2008

It's no different from a vector that contains primitive types, once you
define a comparison operator.

/* Compares values by movie_id. Ratings are ignored. */
bool operator<(movie_rating_pair lhs, movie_rating_pair rhs) {
return lhs.movie_id < rhs.movie_id;
}

...

movie_rating_pair sought = { 837, 3 };
if (std::binary_search(pairs.begin(), pairs.end(), sought)) {
...
}

Thanks a lot chaps, those code examples really helped.
Using binary_search ( or lower_bound) and using values rather than
pointers, the time for 10 million queries has dropped from 92 to 6.5
seconds!
(You were right Peter)
And I'm using almost 1Gb less memory.
Now that's what I call a result!

Sorting an STL map	1	Oct 19, 2013
STL Vector Access	1	Feb 6, 2011
Best way to search through STL maps?	10	Nov 29, 2009
Copying from maps to lists or vectors	5	Apr 17, 2007
Modify STL multiset	2	Dec 7, 2010
how to achieve thread safety at element level in STL data structureslike hash map, vectors	7	Dec 4, 2007
A map of varying vectors	1	Mar 21, 2006
STL Container Choice	11	Mar 28, 2012

Speeding up access to STL vectors?

Steve555

Kai-Uwe Bux

Juha Nieminen

joecook

peter koch

Steve555

Steve555

Steve555

peter koch

Kai-Uwe Bux

Steve555

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads