std::map performance

S

Siemel Naran

jmoy said:
Thanks everybody for your interest. To eliminate possible overhead
from std::string, I used char * (code below). But the gain is only
about 5%. Profiling shows that both in the earlier and latter versions
most of the time is taken by the function _Rb_tree::lower_bound which
is called by map::lower_bound (in the original case this is in turn
called by map::eek:perator[]).

Take a look at lower_bound in tree.h (or some name like _tree.h or
stl_tree.h). It's pretty straightforward. Then compare it to your own
version and see what's different. Also, what does your profiler say about
functions that lower_bound calls, as maybe the bottleneck is here?
By the way my handcoded binary tree is extremely naive and I agree
that it will have a much worse worst case performance. But an average
case performance hit of 100% seems too high a cost for insurance.
Right.

Just to check whether the slowdown may be due to function call
overhead, I made my insert() an inline function. This actually led to
an increase in running time! Why might this be happening? Can it be
due to cache effects?

Maybe instruction page caching. But just a wild guess.


Instead of the code below, you could try tom_usenet's reading into a string
idea, which seems like it does the same thing, but looks simpler.
struct Cstring_less
{
bool operator()(const char *s, const char *t) const{
return strcmp(s,t)<0;
}
};
typedef std::map<const char *,long,Cstring_less> Map;
Map words;

const long maximum=std::numeric_limits<long>::max();

void insert(const char *s)
{
Map::iterator p=words.lower_bound(s);
if (p==words.end()||strcmp(p->first,s)!=0){//Not found
char *dup=new char[strlen(s)+1];
strcpy(dup,s);
std::pair<const char * const,long> tuple(dup,1L);
p=words.insert(p,tuple);
}else{//Found
if (p->second<maximum)
p->second++;
}
}
 
H

Howard Hinnant

Siemel Naran said:
P.J. Plauger said:
23.3.1.2 map element access

T& operator[](const key_type& x);

Returns: (*((insert(make_pair( x, T()))).first)).second.

Function operator[] has only to behave as if it were written this way.
Agreed. What part of the above can you eliminate if a particularly
thorough validation suite is checking for full conformance?

All of it? If T::T() has side effects then the STL way eliminates this.
But just as the return value optimization can ignore side effects in the
copy constructor, so should operator[]. Other than that, I can't think of
anything else. Do you know? (I may not check newsgroups till Tuesday or
Wednesday.)

<nod> See:

http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#334

for the LWG's intent.

-Howard
 
P

P.J. Plauger

Siemel Naran said:
P.J. Plauger said:
23.3.1.2 map element access

T& operator[](const key_type& x);

Returns: (*((insert(make_pair( x, T()))).first)).second.

Function operator[] has only to behave as if it were written this way.
Agreed. What part of the above can you eliminate if a particularly
thorough validation suite is checking for full conformance?

All of it? If T::T() has side effects then the STL way eliminates this.
But just as the return value optimization can ignore side effects in the
copy constructor, so should operator[]. Other than that, I can't think of
anything else. Do you know? (I may not check newsgroups till Tuesday or
Wednesday.)

<nod> See:

http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#334

for the LWG's intent.

Exactly. It took a defect report to clarify the LWG's intent because
the bald statement in the C++ Standard leaves next to no wiggle room.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
J

jmoy

I am a C programmer graduating to C++. As an exercise I wrote a
program to count the number of times that different words occur in a
text file. Though a hash table might have been a better choice, I
chose to use std::map. However, the program runs at less than half the
speed of an equivalent program that uses ordinary handcoded binary
trees. The result is not changed very much if I replace std::string as
the key with a simple string class of my own.

I ultimately figured out why this is so. std::map::find() is a general
algorithm that has to seperately check whether (key1<key2),
(key1>key2) or (key1==key2). Actually since one of these has to be
true it only needs to check the first two. Thus it calls strcmp()
twice for each node it traverses. However my handcoded version knows
about strcmp() which it calls only once and then just tests the saved
return value. Since strcmp() is about the most expensive operation
that takes place in the inner loop, this explains the factor of two
slowdown.

My measurements show that in reality strcmp() is called about 1.6
times more in the std::map version. This is because std::map's trees
are more balanced and hence fewer nodes have to be traversed on
average for each call to find().

Given this, shouldn't my implementation have provided a specialization
of std::map for the case where the key is a std::string or char*?
 
P

P.J. Plauger

jmoy said:
(e-mail address removed) (jmoy) wrote in message

I ultimately figured out why this is so. std::map::find() is a general
algorithm that has to seperately check whether (key1<key2),
(key1>key2) or (key1==key2). Actually since one of these has to be
true it only needs to check the first two. Thus it calls strcmp()
twice for each node it traverses.

Nope.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
J

jmoy

P.J. Plauger said:
Nope.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Oops, I was wrong and you are right. However my basic point remains.
The actual loop in lower_bound from my implementation (GNU libstdc++3)
is

while (__x != 0)
if (!_M_key_compare(_S_key(__x), __k))
__y = __x, __x = _S_left(__x);
else
__x = _S_right(__x);

return iterator(__y);

While you are right that the loop does not call strcmp() twice on each
node, it pays a cost for doing so by going deeper into the tree
unnecessarily even when _S_key(__x)==k (where an optimal
implementation would have returned) and calling strcmp() on the deeper
nodes. Either way the inefficiency lies in throwing away the
information contained in the three-valued (-ve,0,+ve) strcmp() by
converting it to the two-valued (true,false) predicate key_compare().

To reiterate, on my test data strcmp() is called 249860018 times by
the stl::map version and 150069159 times by the handcoded version and
I don't think that there is a bug in my counter code.
 
S

Siemel Naran

P.J. Plauger said:
Exactly. It took a defect report to clarify the LWG's intent because
the bald statement in the C++ Standard leaves next to no wiggle room.

Howard, thanks for the link.

Silly question. What is LWG?
 
S

Siemel Naran

jmoy said:
Oops, I was wrong and you are right. However my basic point remains.
The actual loop in lower_bound from my implementation (GNU libstdc++3)
is

while (__x != 0)
if (!_M_key_compare(_S_key(__x), __k))
__y = __x, __x = _S_left(__x);
else
__x = _S_right(__x);

return iterator(__y);

While you are right that the loop does not call strcmp() twice on each
node, it pays a cost for doing so by going deeper into the tree
unnecessarily even when _S_key(__x)==k (where an optimal
implementation would have returned) and calling strcmp() on the deeper
nodes. Either way the inefficiency lies in throwing away the
information contained in the three-valued (-ve,0,+ve) strcmp() by
converting it to the two-valued (true,false) predicate key_compare().

To reiterate, on my test data strcmp() is called 249860018 times by
the stl::map version and 150069159 times by the handcoded version and
I don't think that there is a bug in my counter code.

Interesting. One question I have is: Is std::map inherently inefficient
because the compare function returns true or false rather than an integer?
With true and false, how do we test for equality? Seems we have to call
(!less(a,b) && !less(b,a)), which means 2 calls to the less function. I'm
sure there were performance oriented people on the standards committee, and
they thought through the issue.

At the same, comment on my other favorite function: std::sort. Is the
version using a less argument returning true or false less efficient than
the version return an integer? I sure hope not!

Jmoy, to test the your premise, please consider trying the following
experiment -- though maybe better experiments exist. Make your stream of
words contain only unique words. Then the balanced binary tree should have
fewer calls to std::strcmp than your hand coded tree, and should be faster.
Now some detail. You say:
To reiterate, on my test data strcmp() is called 249860018 times by
the stl::map version and 150069159 times by the handcoded version and

I don't know how you would generate 150,069,159 words (maybe by counters
like when we count from 1 to 9 then 10 then 11 to 99 then 100 then 101 to
199 etc), and also worry that the std::map<std::string word, int count>
might cause your system to run out of memory. Maybe you could just test on
fewer words.
 
H

Howard Hinnant

Jeff Flinn said:
Library Working Group(?)

Right. When the committee meets, it spends most of its time divided up
into working groups. The current active working groups are Core,
Library and Evolution. Core and Library have an ongoing issues lists
posted at:

http://www.open-std.org/jtc1/sc22/wg21/

Issues have various status indicating what has (or has not) been done
with them, from New (nobody has looked at it), to TC (it is now an
official part of the standard).

-Howard
 
J

jmoy

Siemel Naran said:
Interesting. One question I have is: Is std::map inherently inefficient
because the compare function returns true or false rather than an integer?
With true and false, how do we test for equality? Seems we have to call
(!less(a,b) && !less(b,a)), which means 2 calls to the less function. I'm
sure there were performance oriented people on the standards committee, and
they thought through the issue.

As far as I understand, std::map is inherently inefficient whenever
computing (k1<k2) and (k2<k1) together is significantly cheaper than
computing them seperately. As for why it was designed this way, I have
no guesses and I would like to know what other people on this group
think.
At the same, comment on my other favorite function: std::sort. Is the
version using a less argument returning true or false less efficient than
the version return an integer? I sure hope not!

Not for the implementations of sort that I can think of, but I will
have to check.
Jmoy, to test the your premise, please consider trying the following
experiment -- though maybe better experiments exist. Make your stream of
words contain only unique words. Then the balanced binary tree should have
fewer calls to std::strcmp than your hand coded tree, and should be faster.

Agreed. But this is a very special case. And if I rewrote my handcoded
version to use balanced trees it would be as efficient as std::map in
this case and more efficient in all other cases.
Now some detail. You say:

I don't know how you would generate 150,069,159 words (maybe by counters
like when we count from 1 to 9 then 10 then 11 to 99 then 100 then 101 to
199 etc), and also worry that the std::map<std::string word, int count>
might cause your system to run out of memory. Maybe you could just test on
fewer words.

My test data--which consists of the contents of a fairly large web
site--contains about 130,000 unique words. 150,069,159 is the total
number of times strcmp() is called. The peak memory usage of the
program (code+data) is only about 7MB which is not much for my desktop
machine.
 
P

P.J. Plauger

As far as I understand, std::map is inherently inefficient whenever
computing (k1<k2) and (k2<k1) together is significantly cheaper than
computing them seperately. As for why it was designed this way, I have
no guesses and I would like to know what other people on this group
think.

You need to make both tests at most once per search, and then only
for a container that requires unique keys. So the deeper the tree,
the less extra overhead -- log(N)+1 vs. log(N) ain't no thang.

As for why it was designed that way, there's a certain minimalism
about a strict weak ordering, which is based purely on the ability
to test whether one key is less than another. A predicate that
returns a three-way result may have to perform two tests internally
to develop the ternary result. Yes, you luck out if all you have to
do is a numeric subtract, but that is far from the general case.

So comparing the number of tests you see doesn't mean much if
upwards of half are hidden in one case.

Moreover, tests are often cheap, so simply counting tests is
often a poor indicator of overall performance.

On top of everything else, if you *really* want a fast binary
tree you probably shouldn't even try to balance it. Go read
Knuth. His very first observation is that trees almost always
grow up to be reasonably balanced with no heroics at all in
the building. You want a (nearly) balanced tree in a professional
library to ensure against pathological cases -- as eagerly
supplied by the authors of validation suites --and to guarantee
worst-case behavior in the field.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
S

Siemel Naran

P.J. Plauger said:
You need to make both tests at most once per search, and then only
for a container that requires unique keys. So the deeper the tree,
the less extra overhead -- log(N)+1 vs. log(N) ain't no thang.

The OP reports his hand coded tree is twice as fast. That's big!
As for why it was designed that way, there's a certain minimalism
about a strict weak ordering, which is based purely on the ability
to test whether one key is less than another. A predicate that
returns a three-way result may have to perform two tests internally
to develop the ternary result. Yes, you luck out if all you have to
do is a numeric subtract, but that is far from the general case.

So comparing the number of tests you see doesn't mean much if
upwards of half are hidden in one case.

Moreover, tests are often cheap, so simply counting tests is
often a poor indicator of overall performance.

On top of everything else, if you *really* want a fast binary
tree you probably shouldn't even try to balance it. Go read
Knuth. His very first observation is that trees almost always
grow up to be reasonably balanced with no heroics at all in
the building. You want a (nearly) balanced tree in a professional
library to ensure against pathological cases -- as eagerly
supplied by the authors of validation suites --and to guarantee
worst-case behavior in the field.

Great answer. It seems a stream of words is inherently random, so you don't
need balancing.
 
S

Siemel Naran

Howard Hinnant said:
Right. When the committee meets, it spends most of its time divided up
into working groups. The current active working groups are Core,
Library and Evolution. Core and Library have an ongoing issues lists
posted at:

What is the Evolution group about?
 
P

P.J. Plauger

The OP reports his hand coded tree is twice as fast. That's big!

Moderately. But note what I said next:


Here, the alternate form lucks out because the predicate derives
from strcmp, which is expensive to compute (for longish strings)
but which gives a ternary result essentially for free. Mazeltov.

If a factor of two is so important to you that you think it's
worth reimplementing a red-black tree then go ahead. My bet is
you'll spend more time debugging, and chasing mysterious runtime
crashes, than you'll ever save by avoiding that "big" factor of two.
But don't let me stop you.
Great answer. It seems a stream of words is inherently random, so you don't
need balancing.

Indeed. If you're going to hand optimize, it's always best to first
try to do *less* before you try to do *more*.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
D

David A. Ferguson

Here, the alternate form lucks out because the predicate derives
from strcmp, which is expensive to compute (for longish strings)
but which gives a ternary result essentially for free. Mazeltov.

Is it really luck? Most of my comparable classes are composed
of strings and numbers which both give ternary results for free.
Even the second tier classes use the lower classes for comparison
and hence have the free ternary result.

My experience is not as broad as others but it appears that the
'for free' ternary result is more the rule than the exception.

Every design decision has trade offs and there is no free lunch,
but I have often though that ternary based STL would have been
slightly better.

Of course it is water under the bridge now and as you aptly pointed
out a factor of x2 is hardly worth reinventing the wheel.

Cheers...David
 
P

P.J. Plauger

Is it really luck? Most of my comparable classes are composed
of strings and numbers which both give ternary results for free.
Even the second tier classes use the lower classes for comparison
and hence have the free ternary result.

My experience is not as broad as others but it appears that the
'for free' ternary result is more the rule than the exception.

Hasn't been my experience. But more to the point, ternary wins
significantly only if:

1) the tree isn't very deep

2) the tree tolerates only unique entries

3) the cost of the ternary compare is rather less than twice the
cost of a single less-than compare, and

4) the cost of compares dominates the cost of fiddling nodes
in a tree, whether to erase, insert, or find
Every design decision has trade offs and there is no free lunch,
but I have often though that ternary based STL would have been
slightly better.

Having written and rewritten the STL containers and algorithms
several times now, in two or three different dialects, I strongly
believe in the elegance of strict weak ordering. I say this
knowing full well that there are cases, such as the one in this
thread, where certain optimizations can improve performance.
Of course it is water under the bridge now and as you aptly pointed
out a factor of x2 is hardly worth reinventing the wheel.

My sentiment, obviously.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
J

jmoy

P.J. Plauger said:
You need to make both tests at most once per search, and then only
for a container that requires unique keys. So the deeper the tree,
the less extra overhead -- log(N)+1 vs. log(N) ain't no thang.
I don't think so. When you are traversing a tree with unique keys to
find a particular key, if you do both the tests you can terminate the
traversal early when you find an exact match. The cost of not doing so
will be higher for a deeper tree. This cost is zero for two corner
cases: an input consisting entirely of distinct words and an input
consisting entirely of repetitions of a single word. The worst case
will lie somewhere in between. Though I have not done a formal
analysis I think that the extra cost in the worst case will not be an
additive but a multiplicative one.
In short: not log(N)+1 but C.log(N) [my guess]
... Yes, you luck out if all you have to
do is a numeric subtract, but that is far from the general case.
I think this case is quite general. First, this is the case for all
fundamental types. As my example shows, this is also the case for
strings, and they are quite frequently used as keys. Moreover, this is
even the case when comparison for a complicated type is based on
lexicographic ordering and comparison of simple types: for example
comparing two names by comparing the strings representing the last
names and if they are equal then comparing the strings representing
the first names. The last I think is by far the most general case:
even the comparison of strings is just lexicographic ordering of
arrays of characters.
So comparing the number of tests you see doesn't mean much if
upwards of half are hidden in one case.
See above.
Moreover, tests are often cheap, so simply counting tests is
often a poor indicator of overall performance.

Something being cheap is not good enough reason for using it
wastefully. Where a comparison based on ternary tests is optimal (and
I have argued above that it is likely to be generally so), using it
imposes no overheads on other parts of std::map, so why should we not
use it. More so since while an individual comparison is cheap, taken
together comparisons make up a large proportion of the running times
of find and insert which in turn are likely to be the most frequently
used operations on map.
On top of everything else, if you *really* want a fast binary
tree you probably shouldn't even try to balance it. Go read
Knuth.

My handcoded version was unbalanced (thanks in equal parts to my
laziness and Knuth's book). But that is beside the point anyway.
 
J

jmoy

P.J. Plauger said:
My sentiment, obviously.

The factor of x2 in my experiments is an overstatement, since for most
applications inserting and searching maps in a binary tree will take
only a small proportion of their time. But even then I think that the
STL map should allow users to provide a ternary comparison criteria,
perhaps with template magic to make binary comparisons the default in
order to maintain compatibility. Then we can reinvent the wheel for
situations where it matters while making the change invisible for
users who don't care.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,920
Members
47,464
Latest member
Bobbylenly

Latest Threads

Top