PATCH: Speed up direct string concatenation by 20+%!

L

Larry Hastings

This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.

____________
THE OVERVIEW

I don't remember where I picked it up, but I remember reading years
ago that the simple, obvious Python approach for string concatenation:
x = "a" + "b"
is slow. Obviously it's fine for the simple case above, but if you're
adding together twenty-five things:
x = a + b + c + d + ... + y
the performance is terrible, as the interpreter creates temporary
objects
for each intermediate result. (a + b, then (a + b) + c,
then ((a + b) + c) + d, and so on.)

The suggested idiom for fast string concatenation was this:
x = "".join([a, b, c, d, ... y])
This works fine. But I don't like this approach, not only because it's
kind of ugly and inobvious, but because it violates the "one obvious
way
to do it" principle.

I mentioned all this to a friend of mine (Mike Thornburgh), and told
him matter-of-factly that I'd like to fix this, but the design of
Python
simply precluded it. He said "no it doesn't!" and proceeded to
describe
in great detail an approach to permit fast string concatenation using
+.
I have implemented what he suggested--with only slight changes--and
present it to you below. I only had to touch four files, and only did
major surgery on one. The resulting Python passes the test suite as
well
as my unmodified locally-built Python 2.5. (I didn't install the
external
source trees for stuff like SQLite, so it fails those tests.)

Note that times have changed; using Python 2.5, a + b + c isn't *that*
much slower than "".join([]). But is is still slower. Or, was--with
my
patch, a + b + c becomes *the fastest method* for string concatenation.
Hooray!

(It didn't pay off as *much* as I'd hoped, but it's still an
improvement.)

_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:

* String concatenation using + is now the fastest way to
concatenate
strings (that I know of).

* Throw off the shackles of "".join([]), you don't need it
anymore.

* Did I mention it was faster?

Downsides to this approach:

* Changes how PyStringObjects are stored internally; ob_sval is
no
longer a char[1], but a char *. This makes each StringObject
four bytes larger.

* Adds another memory dereference in order to get the value of
a string, which is a teensy-weensy slowdown.

* Would force a recompile of all C modules that deal directly
with string objects (which I imagine is most of them).

* Also, *requires* that C modules use the PyString_AS_STRING()
macro, rather than casting the object and grabbing ob_sval
directly. (I was pleased to see that the Python source
was very good about using this macro; if all Python C
modules are this well-behaved, this point is happily moot.)

* On a related note, the file Mac/Modules/MacOS.c implies
that there are Mac-specific Python scripts that peer
directly into string objects. These would have to be
changed to understand the new semantics.

* String concatenation objects are 36 bytes larger than
string objects, and this space will often go unreclaimed
after the string is rendered.

* When rendered, string concatenation objects storing long
strings will allocate a second buffer from the heap to
store the string. So this adds some minor allocation
overhead (though this is offset by the speed gain from
the approach overall).

* Will definitely need some heavy review before it could
go in, in particular I worry I got the semantics surrounding
"interned" strings wrong.


Internally, a "string concatenation" object is the same as a "string"
object with two extra fields: an array of references to string objects
(and string concatenation objects), and a count. Also, ob_sstate now
stores a third flag, indicating whether or not the string is a string
concatenation object. External users don't need to worry about these
details, they just use PyString_AS_STRING() and all is well with their
world.

I've already added some optimizations to the approach:

* If you're adding a string to an existing string concatenation
object whose reference count is 1, who isn't full, and who
hasn't been rendered yet, it adds the string directly to the
existing concatenation object (rather than creating a new
one).
This works whether the existing object is on the right or the
left side.

* The general case is a + b + c ..., where you're only adding
strings to the *end*. This will build a degenerate tree
where
the first slot points to the child. The tree for
concatenating
the first twenty-two letters of the alphabet would look like
this:

[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| p q r s t u v
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| i j k l m n o
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
v v v v v v v v
a b c d e f g h

The rendering function is by necessity recursive. However,
it is optimized for this shape; when the tree looks like
this,
it will be rendered purely iteratively--no recursion.

* If, after rendering, the resulting string will be small
enough
to fit inside the extra space normally used by the
concatenation
object, the final string will be copied on top of this
otherwise
now-wasted space.

______________
THE BENCHMARKS

Benchmark 1:
def add(a, b, c, ... t): return a + b + c + ... + t
for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
Benchmark 2:
for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
"ttt"
Benchmark 3:
for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
...., "ttt"])
Benchmark 4:
for i in range(10000000):
x = "aaa"
x += "bbb"
x += "ccc"
...
x += "ttt"
Benchmark 5:
for i in range(10000000):
array = []
array.append("aaa")
array.append("bbb")
array.append("ccc")
...
array.append("ttt")
x = "".join(array)
Benchmark 6:
for i in range(10000000):
x = cStringIO.StringIO()
x.write("aaa")
x.write("bbb")
x.write("ccc")
...
x.write("ttt")

Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
improvement
1 | 32.1s | 32.8s | 26.7s |
22.8%
2 | 18.7s | 18.7s | 15.1s |
23.8%
3 | 16.1s | 16.2s | 16.6s |
-1.1%
4 | 64.6s | 64.0s | 57.1s |
12.1%
5 | 74.1s | 79.4s | 76.7s |
3.5%
6 | 126.0s | too slow, didn't bother!

______________
THE SUBTLE BUG

While developing this patch, I discovered a subtle bug in Python.
It happened only on one gruesome test in the middle of test_descr.py.
The problem was: '' == '' was returning False.

Starting on line 1116 of stringobject.c we have this:

if (op == Py_EQ) {
/* Supporting Py_NE here as well does not save
much time, since Py_NE is rarely used. */
if (a->ob_size == b->ob_size
&& (a->ob_sval[0] == b->ob_sval[0]
&& memcmp(a->ob_sval, b->ob_sval,
a->ob_size) == 0)) {
result = Py_True;
} else {
result = Py_False;
}
goto out;
}

The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is
apparently
a speed optimization; check that the first byte is the same before
proceeding with the memcmp(). But it does this even when *both* strings
are *zero length*! I doubt the Python source tree has an official
position on what the first byte of a zero-length string should be, but
I assert that it should be undefined. In practice, it's seemingly
always
set to the same value in the released Python, but with my changes it's
now possible for them to be different.

I think it's sinful to compare the first bytes of zero-length strings.
Therefore I suggest this bugfix:

if (a->ob_size == b->ob_size
/* >> change >> */ && ( (!a->ob_size || a->ob_sval[0] ==
b->ob_sval[0])
&& memcmp(a->ob_sval, b->ob_sval,

I've already incorporated this fix in my code.

__________
THE FUTURE

If this patch is accepted, it paves the way for another cheap
optimization: string slices could be done without copying.
Add another field into the PyString structure: a reference
to another PyString *. If non-NULL, it means the local ob_sval
actually points into the ob_sval owned by that other PyString *.
It'd be another four bytes, but we'd only need to incur that
when creating a slice; we could set a bit in ob_sstate indicating
"this is a slice", and if so we'd have to copy-on-write ob_sval,
and free the reference when the object was destroyed.

______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere? Put it up on a web page? Post some diff output? (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again. I would be happy to submit that separately. (What can I
say, old habits die hard. I can't stand the newer Microsoft IDEs.)


Cheers,


/larry/

p.s. No, I haven't looked at a Unicode port yet. If there's interest
in
this patch at all, I might take a stab at it.
 
S

Steve Holden

Larry said:
This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.
[...]
______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere? Put it up on a web page? Post some diff output? (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again. I would be happy to submit that separately. (What can I
say, old habits die hard. I can't stand the newer Microsoft IDEs.)


Cheers,


/larry/

p.s. No, I haven't looked at a Unicode port yet. If there's interest
in
this patch at all, I might take a stab at it.
A fairly short reply: you should diff your source against the current
SVN repository and lodge that diff as a patch on SourceForge. It
wouldn't hurt to let one or two developers know, thought they *might*
pick it up without.

Your suggested bug isn't, I think a real bug in the current
implementation because as I understand it Python strings do always
include a trailing null byte to act as a terminator when the string is
passed to C code in extensions.

Nice idea, though. You might also see how it does on tasks like

s = ""
for i in range(100000):
s += "a"

since I believe that's the main culprit in increasing the O(). Also note
that some optimisation has recently been performed on string
concatenation, which I presume your code has already taken into account.

regards
Steve
 
R

Robin Becker

Larry Hastings wrote:
______
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:
.........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real code.
 
R

Rob Williscroft

Larry Hastings wrote in @m7g2000cwm.googlegroups.com in comp.lang.python:
_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

On the python 3k list there is a thread about stings becoming "views",
I think this is similar to what you are doing here.

<url:http://search.gmane.org/?
query=string+view&author=&group=gmane.comp.python.python-
3000.devel&sort=relevance&DEFAULTOP=and&xP=string.view.
&xFILTERS=Gcomp.python.python-3000.devl---A>

http://tinyurl.com/kadco

Rob.
 
F

Felipe Almeida Lessa

28 Sep 2006 19:07:23 -0700 said:
THE BENCHMARKS

Benchmark 1:
def add(a, b, c, ... t): return a + b + c + ... + t
for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
[snip]

What about "a + b"? Or "a + b + c"? Does it have a large overhead on
small string concatenations?
 
C

Carl Friedrich Bolz

Robin said:
Larry Hastings wrote:
______
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.
I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Cheers,

Carl Friedrich
 
C

Carl Friedrich Bolz

Robin said:
Larry Hastings wrote:
______
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.
I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real
code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Cheers,

Carl Friedrich
 
C

Carl Friedrich Bolz

Robin said:
Larry Hastings wrote:
______
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.
I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real
code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Cheers,

Carl Friedrich
 
C

cfbolz

Robin said:
Larry Hastings wrote:
______
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too (at least if you don't do additional things).
I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real code.

I think there are quite a lot of string additions around, it's just
that people get told to use join all the time, so they are not written
using "+".

Cheers,

Carl Friedrich Bolz
 
C

Carl Friedrich Bolz

Carl said:
no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.

code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Sorry for the multiple post :-(
 
L

Larry Hastings

Steve said:
you should diff your source against the current
SVN repository and lodge that diff as a patch on SourceForge.

Okay, I'll try to do that today.

Your suggested bug isn't, I think a real bug in the current
implementation because as I understand it Python strings do always
include a trailing null byte to act as a terminator when the string
is passed to C code in extensions.

Ah! Excellent point. So the Python source tree *does* have an
official position on what the first byte of a zero-length string should
be. I'll fix my code so that is ensured.

Nice idea, though. You might also see how it does on tasks like

s = ""
for i in range(100000):
s += "a"

Sure. Here are the results, but with range (10000000):
Python 2.5 release: 31.0s
Python 2.5 locally built: 30.7s
Python 2.5 concat: 4.4s
% Improvement: *697%*!

Woah! Thanks for suggesting a benchmark where it really shines :)

The real big win here is storing eight pointers per concatenation
object. That means 1/8th as many objects created/destroyed.

In testing your suggested benchmark, it overflowed the stack
when freeing the concatenation tree. So I changed deallocation
of string concatenation objects so it's iterative down the
left-hand side like rendering; now it's happy again. Thanks
again for suggesting the test.

It would still blow up if you ran
s = ""
for i in range(10000000):
s = "a" + s
because it's only clever about recursion down the left-hand side.
So my advice is, don't. :)
Also note that some optimisation has recently been performed on string
concatenation, which I presume your code has already taken into account.

Totally. I'm just glad there was something left for me to contribute!


Rob said:
On the python 3k list there is a thread about stings becoming "views",
I think this is similar to what you are doing here.

As I understand it, sort-of. Definitely moreso in my section on THE
FUTURE--adding another subtype of string for slices which point into
other strings. I believe views are more sophisticated however.

What about "a + b"? Or "a + b + c"? Does it have a large overhead on
small string concatenations?

It's *slightly* slower for two:

def addTwoThings(a, b):
return a + b
for i in range(10000000):
x = addTwoThings("aaa", "bbb")

Python 2.5 release: 6.219s
Python 2.5 locally built: 6.219s
Python 2.5 concat: 6.234s
% improvement: -0.03%

But starts paying off already, even with three:

def addThreeThings(a, b, c):
return a + b + c
for i in range(10000000):
x = addThreeThings("aaa", "bbb", "ccc")

Python 2.5 release: 7.9s
Python 2.5 locally built: 7.7s
Python 2.5 concat: 7.2s
% improvement: 6.94%

Note that I was lazy and only did one benchmark run apiece.

Also, keep in mind that memory utilization will be a little higher
with the string concatenation objects.

no, I think it depends on strings being immutable.

Exactly.


/larry/
 
F

Fredrik Lundh

Larry said:
Sure. Here are the results, but with range (10000000):

ten million? what hardware are you running this on?
Python 2.5 release: 31.0s
Python 2.5 locally built: 30.7s
Python 2.5 concat: 4.4s
% Improvement: *697%*!

Woah! Thanks for suggesting a benchmark where it really shines :)

what's in "s" when that loop is done?

</F>
 
W

William Heymann

It would still blow up if you ran
s = ""
for i in range(10000000):
s = "a" + s


This is a pretty small change but I would suggest xrange instead of range.
That way you don't allocate that large list just to throw all the items away.

In this case xrange will probably be faster then range also.
 
L

Larry Hastings

Fredrik said:
ten million? what hardware are you running this on?

Athlon 64 x2 4400+ (aka 2.2GHz), 3GB of RAM, Windows XP.

what's in "s" when that loop is done?

It's equivalent to " 'a' * 10000000 ". (I shan't post it here.)


Also, I misspoke earlier; it's not an improvement of 697%, but of 597%.
To be precise, it takes about 1/7 the time of the original.

Cheers,


/larry/
 
F

Fredrik Lundh

Larry Hastings wrote:

It's equivalent to " 'a' * 10000000 ". (I shan't post it here.)

but what *is* it ? an ordinary PyString object with a flattened buffer,
or something else ?

</F>
 
L

Larry Hastings

William said:
This is a pretty small change but I would suggest xrange instead of range.

Good point! Since I was calling range() during the benchmark, it was
timed too. Switching to xrange() will mean less overhead.

I re-ran this benchmark (again):
s = ""
for i in range(100000):
s += "a"

The result:
Python 2.5 release: 30.1s
Python 2.5 locally built: 30.4s
Python 2.5 concat: 3.95s
Improvement: *669%*! (1/7.7 the time.)

So xrange(10000000) was adding between 0.4s and 0.9s overhead, and
losing it makes my approach look even better. Thanks!


/larry/
 
L

Larry Hastings

Fredrik said:
but what *is* it ? an ordinary PyString object with a flattened buffer,
or something else ?

At the exact moment that the loop is done, it's a
PyStringConcatenationObject * which points to a deep one-sided tree of
more PyStringConcatenationObject * objects. Its ob_sval is NULL, which
means that the first time someone asks for its value (via the macro
PyString_AS_STRING()) it will be computed. When it's computed, the
interpreter will allocate a buffer of 10000001 bytes and walk the tree,
filling the buffer with ten million 'a's followed by a zero. It'll
then dereference all its children.

The PyStringConcatenationObject struct is a child of PyStringObject,
and external users can ignore the difference as long as they use the
macros in stringobject.h (e.g. using PyString_AS_STRING(), rather than
casting to PyStringObject and using ob_sval directly).

Sorry for misunderstanding the nature of your question the first time,


/larry/
 
F

Fredrik Lundh

At the exact moment that the loop is done, it's a
PyStringConcatenationObject * which points to a deep one-sided tree of
more PyStringConcatenationObject * objects. Its ob_sval is NULL, which
means that the first time someone asks for its value (via the macro
PyString_AS_STRING()) it will be computed. When it's computed, the
interpreter will allocate a buffer of 10000001 bytes and walk the tree,
filling the buffer with ten million 'a's followed by a zero. It'll
then dereference all its children.

so what does the benchmark look like if you actually do this ?

</F>
 
L

Larry Hastings

Fredrik said:
so what does the benchmark look like if you actually do this ?

Okay, timing this:
x = ""
for i in range(100000):
x += "a"
t = x[1] # forces the concat object to render

The result:
Python 2.5 release: 30.0s
Python 2.5 locally built: 30.2s
Python 2.5 concat: 4.3s
Improvement: 600% (1/7 of the time)


/larry/
 
C

Carl Friedrich Bolz

Larry Hastings wrote:
[snip]
The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.
[snip]

Sounds cool. If I understand it correctly you can write even meaner (and
probably even more useless in practive) benchmarks where you safe some
of the intermediate results of the concatenation, thus also explointing
the better space behaviour:


all = []

s = ""

for i in range(1000):
s = s + (str(i) + " ") * 1000
all.append(s)


This should take around 2GB of RAM (so maybe you shouldn't even run it
:) ) on an unpatched CPython but a lot less with your patch.

<sidenode>
This is exactly the sort of experiment that is extremely easy
to do with PyPy. In fact, some other PyPyers and me wrote a very similar
optimization, which can be compiled in if wanted (or not). The whole
implementation took roughly 100 lines of code, of which 65 are in a
separate file not touching anything else, the rest being minimally
invasive changes. We also implemented a similar optimization for string
slicing (only if the slice has a substantial length and a step of 1).
</sidenode>

Cheers,

Carl Friedrich Bolz
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,709
Latest member
AustinMudi

Latest Threads

Top