History of and support for std::basic_string::back()

J

James Kanze

[...]
If you want to allow CoW implementations (such as the one used by
g++), then non-const operator[] (and non-const begin() and end()) must
be able to invalidate iterators, at least the first time they are
called. (And IMHO, if the standard bans CoW implementations, it is
broken.) This was recognized in previous versions of the standard.
CoW requires multi-threaded synchronization deep in the inners of
std::string, where it would be totally on the wrong level.

It's possible, at least in theory, to use atomic types for the
counters. And the synchronization is there so that the client
code is unaware of the sharing.
In this
regard, CoW seems like a typical premature optimization attempt which may
easily become a pessimization on new hardware (thinking NUMA), so why are
you so keen about supporting it?

It's existing practice. It was introduced because it was needed
in some applications. I used it in my pre-standard String
class, because I'd had a performance problem that it solved.
And this is a library; you can't tell the user to implement it
if his profiler shows that he needs it.
If banning CoW conincidentally eliminates obscure UB in things like var
[0]==var[1], then I'm all for it.

In C++98, there was no problem with var[0] == var[1]. There was
in the CD2 preceding the standard, but the French national body
spotted it, and raised the issue in its comments. The resulting
wording was far from perfect, but it did make the intent fairly
clear.
 
J

James Kanze

As far as I have understood this is not really the case unfortunately.
Lock-less synchronization might be better than locks, but they still have
to use proper memory barriers, to avoid code reordering by hardware. Memory
barriers may become quite expensive, especially on NUMA.

The synchronization certainly isn't free. But the alternative
is deep copy, and most memory allocators will require
synchronization as well, and come at a much larger cost than the
atomic counting which can be used in CoW. (In our application,
at least, CoW is a definite win.)
 
Ö

Öö Tiib

Ok, I see your point now. You are saying that the C++ standard library
can customize itself according to the hardware (while compiled or
installed or loaded into a process). An interesting idea, but seems very
tricky in practice. Do you have any links to any C++ library
implementations actually doing this?

I did not meant something absurdly clever like that. My vision had
similarity with checked iterators that have been ages in gcc and msvc.
Toggled by macro. No binary compatibility. If you want to see how
something really optimizes itself to platform then Qt and boost sources
are nice reading material.

C++ standard library is mostly free from those concerns since it is part
of implementation and implementation targets particular platform. It comes
with compiler and may contain whatever platform-specific voodoo as long
it fulfills the interface requirements given by standard.

I think that standard does not even explicitly say that somewhere must
actually exist text file named "string". We should write "#include <string>"
to use 'std::basic_string said:
The problem is that the string header is typically header-only, which
means it is compiled into application code. In libstdc++, the atomic
operations on refcounter are directly in the header file. On the other
hand, this issue affects ABI, so all code loaded in a process must agree
whether it uses CoW or not. But not all software is compiled directly on
the target machine, not even in Linux, it is common to have some shared
libraries compiled elsewhere. In particular, an application compiled on
another Linux box must work together with libstdc++ installed on the
target machine. IOW, deciding such issues at compile-time is far too
early. And we do not need a new ABI-breaking compiler switch, especially
for something so mundane that it should not deserve any programmer
attention at all as you rightly notice. Note that things like -arch are
easier because they don't affect ABI.

Yes, libstdc++ is very nice effort, it tries to implement portable
standard library for free in C++. Issues of developing portable standard
library are orthogonal to question if there may be standard library on
some platform that implements CoW strings.

Note that some parts of C++11 standard library can not be implemented in
plain C++ with no support from platform or compiler (like <thread> or
<atomic>) anyway. Some other parts can be written (like <string>) but
that does not mean those must be implemented in C++.

I understand term "ABI" as "application binary interface" that is
interface between application and operating system or application and
other application. Unfortunately C++ does not define anything of it;
it does not even have modules yet. Do not you see we have nothing?
Our API with environment is: parameters of 'main', 'cin', 'cout',
'cerr', 'clog', 'system("pause")' and return value of 'main'. ;-(

About cross-compiling I did not understand your point. We *always* do it
for embedded systems. However we use compiler for target platform *and*
library for target platform together. How else?
One could use a non-inlined function or a global static to make this
decision on run-time. But non-inlined functions and global statics are
again known performance hazards. Besides, if libstdc++ introduced such a
feature, it would break ABI compatibility with previous libstdc++
versions.

These are again concerns with what C++ does not deal at all. :( Even that
name mangling (to allow linkage with other things) is nowhere mentioned.
So if you have two compilers that compile std::string that can be somehow
exchanged between the modules then that is achieved with efforts outside
of C++. Please do not say that lack of CoW in string somehow helps us out
of that hole here because it does not.
FWIW, LLVM seems to have given up CoW strings. From
http://libcxx.llvm.org/ : "For example, it is generally accepted that
building std::string using the "short string optimization" instead of
using Copy On Write (COW) is a superior approach for multicore machines
(particularly in C++11, which has rvalue references). Breaking ABI
compatibility with old versions of the library was determined to be
critical to achieving the performance goals of libc++."

Yes, LLVM is again great effort. Since most money in it comes from
pockets of Apple it goes where Apple wants it to go. Apple wants it
to work well with its legacy Objective-C libraries and so it has to
work. It should not affect everything else if string in C++ compilers
for OS-X/iOS platforms have CoW or short string or neither. It should
be concern of Apple to measure and to decide.
The C++ standard library is supposed to provide a good general purpose
implementation of all its features. There are always specific corner
cases where it does not work, there is no silver bullet for everything.
It seems however that massive multithreading will become the new norm,
not a corner case. And a CoW string class looks exactly like an extra
custom library for a specific usage case (large strings, copied often),
probably needing extra care in thread passing.

Standard library must be present with implementation and work with the
implementation. It should deal with platform-specific problems internally
in itself. It should not delegate those back to developers.
 
J

James Kanze

[...]
Ok, I see your point now. You are saying that the C++ standard library
can customize itself according to the hardware (while compiled or
installed or loaded into a process). An interesting idea, but seems very
tricky in practice. Do you have any links to any C++ library
implementations actually doing this?

It's not dynamic customization, and it isn't only according to
the hardware. Different programs use strings in different ways.
CoW can be an important optimization for many of them. An
implementation has to "guess" what it thinks is the best
solution for what it thinks are the most typical utilisations of
the class. CoW is certainly the right choice for some uses, and
if the library authors think that those uses represent the
majority of its clients, they will (or should) use CoW.
The problem is that the string header is typically header-only, which
means it is compiled into application code. In libstdc++, the atomic
operations on refcounter are directly in the header file. On the other
hand, this issue affects ABI, so all code loaded in a process must agree
whether it uses CoW or not.

All libraries must agree. You can break both g++ and VC++
simply by changing a few options in your compiler. Which is
a shame, but that's the current situation.
FWIW, LLVM seems to have given up CoW strings. From
http://libcxx.llvm.org/ : "For example, it is generally accepted that
building std::string using the "short string optimization" instead of
using Copy On Write (COW) is a superior approach for multicore machines
(particularly in C++11, which has rvalue references).

Saying something is "generally accepted" is often an excuse for
not doing it right. (I'm not saying this is the case here; LLVM
might feel that their customers are best supported by short
string optimization. But globally, short string optimization
only applies when the client code uses a lot of short strings.
(And even then, it depends on how he uses them---at least as
implemented in VC++, it requires an if for every access. Which
means that code which does a lot of indexing into strings will
run slower.)

With regards to the assertion: it is generally accepted that
implementing CoW correctly using atomic counters (rather than
mutexes) requires a great deal of skill, and that it is easy to
get wrong. It has nothing to do with what is better for the
users of the library, and everything to do with the fact that
implementing lock free algorithms requires special skills, which
often aren't (or at least weren't) present in the teams writing
libraries. The argument against CoW is that it is too easy to
get wrong. (The g++ implementation has one small bug, for
example, although I'm willing to bet that no one has ever
actually encountered it.)
Breaking ABI compatibility with old versions of the library
was determined to be critical to achieving the performance
goals of libc++."

A lesson taught by experience, no doubt.
The C++ standard library is supposed to provide a good general purpose
implementation of all its features. There are always specific corner
cases where it does not work, there is no silver bullet for everything.
It seems however that massive multithreading will become the new norm,
not a corner case. And a CoW string class looks exactly like an extra
custom library for a specific usage case (large strings, copied often),
probably needing extra care in thread passing.

Actually, it is the small string optimization which seems to be
the corner case. For most applications I've seen, it is
a pessimization.

The C++ standard is supposed to provide implementers enough
liberty to implement what they think best for their user
community. If some implementer things that systematic deep copy
(with or without sso) is best for their community, perhaps
because there is no cheap atomic counters, then they should be
free to do so. If they think CoW is best, they should have that
liberty as well.
 
Ö

Öö Tiib

The target platform is x86_64. This does not mention the number of cores or
NUMA configuration, nor should it, in my mind. The implementation should
take care that the standard library and the compiled user programs worked
reasonably well, regardless of the number of cores and such, on the machine
they are actually executed.

You are correct if exactly same compiled binary must run on large
multi-core x86_64 system and whatever old thing that has AMD64
instruction set (fully specified August 2000). Such binary can't
be optimal for everything with such wall to wall requirements.
However ... you got quite unusual goal there.

Usually we put some low limit of target architecture where our
program works. If it happens to need that NUMA then we build for NUMA.
If it is fine on single old AMD64 core then we can even compile it
as single-threaded. Why should we care if most cores are idle on some
huge box if it runs fine on one of those?
 
J

Jorgen Grahn

You are correct if exactly same compiled binary must run on large
multi-core x86_64 system and whatever old thing that has AMD64
instruction set (fully specified August 2000). Such binary can't
be optimal for everything with such wall to wall requirements.
However ... you got quite unusual goal there.

In my experience that's how it usually ends up. E.g. you don't have
hundreds of Debian Linux binary distributions; you have one for x86,
one for x86_64 aka amd64 and so on.

My main desktop is an early (~2003) single-core AMD64. I'm happy
it hasn't been declared obsolete (because it works very well) but
at the same time I can't help wondering how much better it would
run if libraries and compilers didn't take SMP into account.

/Jorgen
 
J

James Kanze

All the synchronization finally comes down to the hardware level, and the
hardware ultimately decides how large overhead it causes. Presumably
hardware makes this as fast as possible, so if something is not needed on
given hardware it would not cause any overhead.

The hardware has to deal with compromizes too. It's trivial to
design hardware where the cost of synchronization is zero (and
this has been the case for most of the hardware I've worked on,
through out most of my career). Just use a single core, and
don't pipeline anything. Most hardware is designed to optimize
the cases which don't require synchronization, however, with the
result that synchronization ends up being expensive.
On Intel architecture I think this used to be the cost of the LOCK
assembler prefix. On some hardware the LOCK prefix would be just a non-op
and there would appear an overhead of having a number of unneeded LOCK
prefix bytes in the executable code, which makes it marginally slower to
pump through the CPU. Interestingly enough, Intel has defined LOCK
redundant for some instructions like XCHG, meaning that even this overhead
is not there any more.

No. It assumes that the only use of these instructions will be
cases where the lock is needed. As implemented on most Intel (I
think---I've not checked all of the documentation), the LOCK
implies a memory fence: all preceding writes will finish before
starting the instruction it controls, and all writes in the
instruction will finish before the next instruction is executed.
(If it doesn't, then you still need the fences in a lock free
algorithm.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,102
Messages
2,570,645
Members
47,243
Latest member
CorrineCad

Latest Threads

Top