J
James Aguilar
Hey all,
I have a bit of code that implements a suffix array, which is essentially an
array of iterators or indexes into a string. The array starts out like this
String: ABAB
Array: 0123
But then gets sorted
Array: 2031
2 AB
0 ABAB
3 B
1 BAB
All of this is well and good, but I have a little problem. Currently, I am
using the STL's sort algorithm and providing a custom comparison function,
which ends up sorting this vector<string::iterator> alphabetically. Here is
the comparison and other associated functions:
bool itLess(string::const_iterator it1, string::const_iterator it2, int
skip)
{
advanceIterators(it1, it2, skip);
return (*it1 < *it2 || *it1 == '\0');
}
int advanceIterators(string::const_iterator &it1, string::const_iterator
&it2,
int skip)
{
string::const_iterator ref(it1);
advanceNoCheck(it1, it2, skip);
while (*it1 == *it2 && !(*it1 == '\0')) {
++it1;
++it2;
}
return it1 - ref;
}
void advanceNoCheck(string::const_iterator &it1, string::const_iterator
&it2,
int skip)
{
if (skip) {
it1 += skip;
it2 += skip;
}
}
The way the sorting works is that one method goes through the text string
and makes an iterator for each position, putting it into the array.
Afterwards, I call 'sort(m_array.begin(), m_array.end(), &itLess).' It
takes entirely too long.
I have profiled my code with gprof (I know, it's platform specific, but you
should be able to read the output). The code was compiled with all
optimizations turned on, and run on a text of 10 million characters that did
not have a high incidence of repetition. Here is the output of the
profiler:
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
int advanceIterators(string::const_iterator &, string::const_iterator &):
98.92 1196.55 1196.55 302747982 0.00 0.00
bool itLess(string::const_iterator, string::const_iterator):
0.42 1201.59 5.04 282747983 0.00 0.00
std::__unguarded_partition([template junk]):
0.21 1204.10 2.51 1074094 0.00 0.00
void advanceNoCheck(string::const_iterator &, string::const_iterator &):
0.09 1205.20 1.10 302747982 0.00 0.00
Clearly, my advanceIterators (and hence, all of my iterator comparison) is
far too slow. If I run my code without optimizations (that is, if I do not
inline iterator:perator ++ and iterator:perator *), you would see that
most of the time spent in advanceIterators is actually spent in those two
methods. Does anyone know a faster way to sort string::const_iterators
lexicographically?
- JFA1
I have a bit of code that implements a suffix array, which is essentially an
array of iterators or indexes into a string. The array starts out like this
String: ABAB
Array: 0123
But then gets sorted
Array: 2031
2 AB
0 ABAB
3 B
1 BAB
All of this is well and good, but I have a little problem. Currently, I am
using the STL's sort algorithm and providing a custom comparison function,
which ends up sorting this vector<string::iterator> alphabetically. Here is
the comparison and other associated functions:
bool itLess(string::const_iterator it1, string::const_iterator it2, int
skip)
{
advanceIterators(it1, it2, skip);
return (*it1 < *it2 || *it1 == '\0');
}
int advanceIterators(string::const_iterator &it1, string::const_iterator
&it2,
int skip)
{
string::const_iterator ref(it1);
advanceNoCheck(it1, it2, skip);
while (*it1 == *it2 && !(*it1 == '\0')) {
++it1;
++it2;
}
return it1 - ref;
}
void advanceNoCheck(string::const_iterator &it1, string::const_iterator
&it2,
int skip)
{
if (skip) {
it1 += skip;
it2 += skip;
}
}
The way the sorting works is that one method goes through the text string
and makes an iterator for each position, putting it into the array.
Afterwards, I call 'sort(m_array.begin(), m_array.end(), &itLess).' It
takes entirely too long.
I have profiled my code with gprof (I know, it's platform specific, but you
should be able to read the output). The code was compiled with all
optimizations turned on, and run on a text of 10 million characters that did
not have a high incidence of repetition. Here is the output of the
profiler:
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
int advanceIterators(string::const_iterator &, string::const_iterator &):
98.92 1196.55 1196.55 302747982 0.00 0.00
bool itLess(string::const_iterator, string::const_iterator):
0.42 1201.59 5.04 282747983 0.00 0.00
std::__unguarded_partition([template junk]):
0.21 1204.10 2.51 1074094 0.00 0.00
void advanceNoCheck(string::const_iterator &, string::const_iterator &):
0.09 1205.20 1.10 302747982 0.00 0.00
Clearly, my advanceIterators (and hence, all of my iterator comparison) is
far too slow. If I run my code without optimizations (that is, if I do not
inline iterator:perator ++ and iterator:perator *), you would see that
most of the time spent in advanceIterators is actually spent in those two
methods. Does anyone know a faster way to sort string::const_iterators
lexicographically?
- JFA1