Comparing Multiple Strings More Efficiently

B

Brad

I think there are better (more efficient) ways to do this, but I've
only ever needed this simple approach:

void compare(const std::vector<std::string>& strings)
{
// Normally, this is randomly generated. It is a C style
string.
char the_string[4] = "abc";

std::vector<std::string>::const_iterator send(strings.end());

for (std::vector<std::string>::const_iterator sit = strings.begin();
sit != send; ++sit)
{
if (the_string == *sit)
std::clog << "MATCH:\t" << *sit << std::endl;
}
}

Basically, I'm comparing a randomly generated string to a vector of
strings by iterating through the vector (as seen above). Here is the
problem: The vector may have one or millions of strings and execution
time slows considerably when I go above a few thousand strings.

Is there a more efficient way to approach this? My code works fine on
smaller vectors, I just never expected to do millions and it may not
be possible to do that many as quickly as a few, but I wanted to ask.
Perhaps there is something in algorithm that I could use. Just point
me to it. I prefer standard C++ wherever possible.

Thanks,

Brad
 
V

Vaclav Haisman

Brad wrote, On 21.6.2010 18:07:
I think there are better (more efficient) ways to do this, but I've
only ever needed this simple approach:

void compare(const std::vector<std::string>& strings)
{
// Normally, this is randomly generated. It is a C style
string.
char the_string[4] = "abc";

std::vector<std::string>::const_iterator send(strings.end());

for (std::vector<std::string>::const_iterator sit = strings.begin();
sit != send; ++sit)
{
if (the_string == *sit)
std::clog << "MATCH:\t" << *sit << std::endl;
}
}

Basically, I'm comparing a randomly generated string to a vector of
strings by iterating through the vector (as seen above). Here is the
problem: The vector may have one or millions of strings and execution
time slows considerably when I go above a few thousand strings.

Is there a more efficient way to approach this? My code works fine on
smaller vectors, I just never expected to do millions and it may not
be possible to do that many as quickly as a few, but I wanted to ask.
Perhaps there is something in algorithm that I could use. Just point
me to it. I prefer standard C++ wherever possible.
Your description seems rather too cut down, hiding the real reasons for this
questions. For good decision it is important to know how much is the strings
vector mutating, etc.

Anyway, try taking a look at Trie <http://en.wikipedia.org/wiki/Trie>
 
B

Brad

Brad wrote, On 21.6.2010 18:07:
I think there are better (more efficient) ways to do this, but I've
only ever needed this simple approach:
void compare(const std::vector<std::string>& strings)
{
        // Normally, this is randomly generated. It is a C style
string.
   char the_string[4] = "abc";
   std::vector<std::string>::const_iterator send(strings.end());
   for (std::vector<std::string>::const_iterator sit = strings.begin();
sit != send; ++sit)
   {
                   if (the_string == *sit)
                           std::clog << "MATCH:\t" << *sit << std::endl;
   }
}
Basically, I'm comparing a randomly generated string to a vector of
strings by iterating through the vector (as seen above). Here is the
problem: The vector may have one or millions of strings and execution
time slows considerably when I go above a few thousand strings.
Is there a more efficient way to approach this? My code works fine on
smaller vectors, I just never expected to do millions and it may not
be possible to do that many as quickly as a few, but I wanted to ask.
Perhaps there is something in algorithm that I could use. Just point
me to it. I prefer standard C++ wherever possible.

Your description seems rather too cut down, hiding the real reasons for this
questions. For good decision it is important to know how much is the strings
vector mutating, etc.

Anyway, try taking a look at Trie <http://en.wikipedia.org/wiki/Trie>

Thanks for the replies. The vector does not change. The reason for 1
or 1 million is that users may supply one comparison string as an
argument or a text file that I read line by line. So whether it is 1
or 1 million, I push the string (or strings) into a vector and then
compare them. I'll try the suggestions provided.

Thanks again,

Brad
 
J

Jorgen Grahn

I think there are better (more efficient) ways to do this, but I've
only ever needed this simple approach:

void compare(const std::vector<std::string>& strings)
{
// Normally, this is randomly generated. It is a C style
string.
char the_string[4] = "abc";

std::vector<std::string>::const_iterator send(strings.end());

for (std::vector<std::string>::const_iterator sit = strings.begin();
sit != send; ++sit)
{
if (the_string == *sit)
std::clog << "MATCH:\t" << *sit << std::endl;
}
}

Basically, I'm comparing a randomly generated string to a vector of
strings by iterating through the vector (as seen above).

Or to be more precise, you're doing std::find(a, b, the_string)!=b.
"Compare" doesn't really describe it well.
Here is the
problem: The vector may have one or millions of strings and execution
time slows considerably when I go above a few thousand strings.

Is there a more efficient way to approach this? My code works fine on
smaller vectors, I just never expected to do millions and it may not
be possible to do that many as quickly as a few, but I wanted to ask.
Perhaps there is something in algorithm that I could use. Just point
me to it. I prefer standard C++ wherever possible.

(Then why didn't you use std::find above?)

If you cannot gain anything precalculate something over 'strings'
(which you could if the same 'strings' are searched for many
'the_string'), then I don't think there's anything you can do.

If you wanted to find 'the_string' as a substring in any of 'strings',
then there are algorithms which precalculate on 'the_string' and thus
save time at each iteration. One such algorithm is Knuth-Morris-Pratt,
I think. And I'd expect a regex library to use such techniques.

/Jorgen
 
J

Jorgen Grahn

I think there are better (more efficient) ways to do this, but I've
only ever needed this simple approach:

void compare(const std::vector<std::string>& strings)
{
// Normally, this is randomly generated. It is a C style
string.
char the_string[4] = "abc";

std::vector<std::string>::const_iterator send(strings.end());

for (std::vector<std::string>::const_iterator sit = strings.begin();
sit != send; ++sit)
{
if (the_string == *sit)
std::clog << "MATCH:\t" << *sit << std::endl;
}
}

Basically, I'm comparing a randomly generated string to a vector of
strings by iterating through the vector (as seen above).

Or to be more precise, you're doing std::find(a, b, the_string)!=b.
"Compare" doesn't really describe it well.
Here is the
problem: The vector may have one or millions of strings and execution
time slows considerably when I go above a few thousand strings.

Is there a more efficient way to approach this? My code works fine on
smaller vectors, I just never expected to do millions and it may not
be possible to do that many as quickly as a few, but I wanted to ask.
Perhaps there is something in algorithm that I could use. Just point
me to it. I prefer standard C++ wherever possible.

(Then why didn't you use std::find above?)

If you cannot gain anything precalculate something over 'strings'
(which you could if the same 'strings' are searched for many
'the_string'), then I don't think there's anything you can do.

If you wanted to find 'the_string' as a substring in any of 'strings',
then there are algorithms which precalculate on 'the_string' and thus
save time at each iteration. One such algorithm is Knuth-Morris-Pratt,
I think. And I'd expect a regex library to use such techniques.

And by the way, when you are multiposting like you did to
comp.lang.c++.moderated, *please say so* so we don't have to waste our
time replying. You already got better answers there.

/Jorgen
 
B

Brad

Are you using vector reserve? You should be.

Stephen Howe

Yes, I am. reserve speeds things up a lot. I've used the sorted
binary_search idea (thanks Paavo) and it is much faster now. One
string requires about 25 seconds. One million requires 120 seconds.
This is acceptable performance and many times faster than before.

Apologies for the comp.lang.c++ moderated post. I did not realize my
post was accepted there so after waiting a few hours I posted here.

Thanks again,

Brad
 
B

Brad

If the vector of strings is more or less constant during the program
life-time, then you can keep it sorted (see std::sort()) and use
std::binary_search() or std::lower_bound() instead of the loop.

If the vector frequently changes, then you can use std::set<std::string>
instead, which keeps the strings sorted all the time automatically, and
use the find() member function of std::set() for lookup.

hth
Paavo

For the sake of comparison, I tried boost::unordered_set as well. It
is slightly slower on smaller sets, but significantly faster on larger
sets:

1 string: unordered_set (36 seconds) binary_search (28 seconds)
1,000,000 strings: unordered_set (60 seconds) binary_search (127
seconds)

Either approach is much faster than my initial loop, but unordered_set
seems to be best (for this problem at least) when the set grows very
large.
 
J

Jorgen Grahn

.
Apologies for the comp.lang.c++ moderated post. I did not realize my
post was accepted there so after waiting a few hours I posted here.

No problem, but you should have mentioned it.

There *is* a longer delay over there, so if you have to have whole
ping-pong style dialogues to get to the core of the problem, c.l.c++
will typically be much faster.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top