which is better, string concatentation or substitution?

J

John Salerno

My initial feeling is that concatenation might take longer than
substitution, but that it is also easier to read:


def p(self, paragraph):
self.source += '<p>' + paragraph + '</p>\n\n'

vs.

def p(self, paragraph):
self.source += '<p>%s</p>\n\n' % paragraph



Is there a preference between these two ways?
 
L

Leif K-Brooks

John said:
My initial feeling is that concatenation might take longer than
substitution

Doesn't look that way:

leif@ubuntu:~$ python -m timeit "'<p>%s</p>\n\n' % 'foobar'"
1000000 loops, best of 3: 0.6 usec per loop
leif@ubuntu:~$ python -m timeit "'<p>' + 'foobar' + '</p>\n\n'"
1000000 loops, best of 3: 0.358 usec per loop

but that it is also easier to read:

I prefer string formatting for readability, but it's a matter of
personal preference.
 
L

Leif K-Brooks

fuzzylollipop said:
niether .join() is the fastest

Please quote what you're replying to.

No, it's the slowest:

leif@ubuntu:~$ python -m timeit "'<p>%s</p>\n\n' % 'foobar'"
1000000 loops, best of 3: 0.607 usec per loop
leif@ubuntu:~$ python -m timeit "'<p>' + 'foobar' + '</p>\n\n'"
1000000 loops, best of 3: 0.38 usec per loop
leif@ubuntu:~$ python -m timeit "''.join(['<p>', 'foobar', '</p>\n\n'])"
1000000 loops, best of 3: 0.817 usec per loop
 
D

Duncan Booth

Leif said:
fuzzylollipop said:
niether .join() is the fastest

Please quote what you're replying to.

No, it's the slowest:

leif@ubuntu:~$ python -m timeit "'<p>%s</p>\n\n' % 'foobar'"
1000000 loops, best of 3: 0.607 usec per loop
leif@ubuntu:~$ python -m timeit "'<p>' + 'foobar' + '</p>\n\n'"
1000000 loops, best of 3: 0.38 usec per loop
leif@ubuntu:~$ python -m timeit "''.join(['<p>', 'foobar', '</p>\n\n'])"
1000000 loops, best of 3: 0.817 usec per loop

If you are only concatenating a few strings together, then straight
concatenation will be faster, but when joining many strings together
concatenating strings can be much slower compared to join. In the OP's
original example:

def p(self, paragraph):
self.source += '<p>' + paragraph + '</p>\n\n'

it is the concatenation to self.source which is could become the
bottleneck, it doesn't really matter how the text of the paragraph
assembled.

For most purposes use what looks clearest at the time: it isn't worth the
hassle of obfuscating your code until you've identified a real cpu hog. On
the other hand, the str.join idiom is sufficiently common in Python that
sometimes it wins on clarity and simplicity as well. e.g. If you build a
list of lines to join then you don't have to repeat '\n' on the end of each
component line.

BTW, be careful using timeit. I nearly got caught out running your tests:

C:\Python25>python -m timeit "''.join(['<p>', 'foobar', '</p>\n\n'])"
1000000 loops, best of 3: 0.872 usec per loop

C:\Python25>python -m timeit "'<p>' + 'foobar' + '</p>\n\n'"
10000000 loops, best of 3: 0.049 usec per loop

C:\Python25>python -m timeit "'<p>%s</p>\n\n' % 'foobar'"
10000000 loops, best of 3: 0.0495 usec per loop

C:\Python25>cd \python24

C:\Python24>python -m timeit "''.join(['<p>', 'foobar', '</p>\n\n'])"
1000000 loops, best of 3: 1.05 usec per loop

C:\Python24>python -m timeit "'<p>' + 'foobar' + '</p>\n\n'"
1000000 loops, best of 3: 0.359 usec per loop

Spot the really fast concatenations in Python 2.5 which is now detecting
the constant strings and concatenating them once only. It also does that
for the string formatting which only leaves poor old join to actually do
any work in these tests.
 
R

Roy Smith

John Salerno said:
My initial feeling is that concatenation might take longer than
substitution, but that it is also easier to read:


def p(self, paragraph):
self.source += '<p>' + paragraph + '</p>\n\n'

vs.

def p(self, paragraph):
self.source += '<p>%s</p>\n\n' % paragraph



Is there a preference between these two ways?

One may be marginally faster, but they both require copying the source
string, and are thus both O(n). If you're just doing one or a small fixed
number of these, it really doesn't matter. Pick whichever one you think is
easier to read.

On the other hand, if you're doing a lot of them (i.e. in a loop), the
entire loop will now be O(n^2), which is a killer. If that's the case,
what you want to do is accumulate the individual substrings in a list, then
join the list elements all at once:

parts = []
for paragraph in foo:
parts.append ('<p>')
parts.append (paragraph)
parts.append ('<p>\n\n')
# or maybe instead of that ...
# parts += ['<p>', paragraph, '<p>\n\n']

self.source = "".join (parts)

This only requires a single copy, and thus you're back to being O(n), which
beats the heck out of O(n^2).
 
J

John Salerno

Roy said:
One may be marginally faster, but they both require copying the source
string, and are thus both O(n).

Sorry, I'm not familiar with the O(n) notation.
If you're just doing one or a small fixed
number of these, it really doesn't matter. Pick whichever one you think is
easier to read.

Thanks guys. I have a handful of methods that each do this task once per
call, so I suppose this counts as not a lot, at least not at one time.
And it seems like a good point that the real problem could be constantly
concatenating with self.source, rather than the smaller pieces being put
together.
 
J

John Salerno

Duncan said:
If you build a
list of lines to join then you don't have to repeat '\n' on the end of each
component line.

How would that work? Wouldn't the last line in the list still need the
newlines?
 
R

Roy Smith

John Salerno said:
Sorry, I'm not familiar with the O(n) notation.

OK, here's a quick tutorial to "big-oh" notation. It's a way of
measuring algorithmic complexity. The O stands for "on the Order
of".

For any algorithm, if you process n things (in the case of the strings
we're talking about, n would be the total number of characters in all
the strings), you can compute the number of steps it takes to complete
all the processing as some function of n.

For example, let's say a given algorithm took 4n^2 + 5n + 17 steps to
complete. It doesn't take much experimentation to prove to yourself
that for all but the very smallest values of n, the constant 17 is
completely insignificant. In fact, n doesn't have to get very big
before the 5n term becomes insignificant too. To a reasonable
approximation, you could say that the algorithm takes 4n^2 steps and
be close enough. For small values of n, this will be a bad
approximation, but it doesn't really matter for small values of n.
For large values of n (think hundreds, thousands or even millions),
it's just fine.

So, the first rule of O() notation is that when looking at how fast an
algorithm runs, you need only consider the highest order term
(i.e. the one with the biggest exponent).

But, it gets even better. Let's compare two algorithms, the one
above, which takes (approximately) 4n^2 steps, and another one which
takes 10n steps, for varous values of n:

n 10n 4n^2
--- ------ ------
1 10 4
2 20 16
3 30 36
4 40 64
5 50 100
6 60 144
7 70 196
8 80 256
9 90 324
10 100 400
11 110 484
12 120 576
13 130 676
14 140 784
15 150 900
16 160 1024
17 170 1156
18 180 1296
19 190 1444
20 200 1600

Notice that it doesn't take long for the fact that n^2 grows a lot
faster than n to swamp the fact that 10 is bigger than 4. So the next
step in simplification is to say that not only don't the lower-order
terms matter, but the coefficient on the highest order term doesn't
even matter. For a large enough n, all that matters is the exponent
(for the moment, I'm making a further simplification that all
algorithms can be described by polynomials with integer exponents).

So, we get things like:

O(n^0), which is almost always written as O(1). This is a "constant
time" algorithm, one which takes the same amount of steps to execute
no matter how big the input is. For example, in python, you can
write, "x = 'foo'". That assignment statement takes the same amount
of time no matter how long the string is. All of these execute in the
same number of steps:

x = ''
x = 'foo'
x = 'a very long string with lots and lots of characters'

We can say that "assignment is constant time", or "assignment is
O(1)".

The next step up is O(n). This is "linear time" algorithm. It takes
a number of steps which is directly proportional to the size of the
input. A good example might be "if 'x' in 'bar'". The obvious way to
implement this (and, I assume, the way it is implemented in Python),
is to loop over the string, comparing 'x' to each character in 'bar',
one by one. This takes as many steps as there are characters in the
string.

Next comes O(n^2), also knows as "quadratic time". This means if your
input is of size n, the algorithm will take n^2 steps to run.
Quadratic algorithms are generally considered very slow, and to be
avoided if at all possible.

Once you get used to thinking like this, it's easy to look at a piece
of code and say, "oh, that's quadratic", or "that's linear", and
instantly know which is faster for some some large input. And once
you've started thinking like that, you've made one of the big jumps
from thinking like a coder to thinking like a computer scientist (or,
if you prefer, like a software engineer).

Google "algorithmic complexity" or "big o notation" (perhaps spelled
"big oh") for more info. I would normally recommend Wikipedia, but I
just took a quick look at their "Big O notation" article, and noticed
it's full of formal mathematical gobbledygook which just serves to
obscure what is really a fairly easy concept.
 
T

Ted

Thank you Roy.

It seems if you lurk here long enough you eventually get all you
questions answered without even asking!
;-)
 
J

John Salerno

Roy said:
OK, here's a quick tutorial to "big-oh" notation.

Wow, that was fantastic (and interesting)! Did you just write that? That
should be saved somewhere! Mind if I post it on my website? (don't
worry, no one goes there anyway) :)
 
R

Roy Smith

Wow, that was fantastic (and interesting)! Did you just write that? That
should be saved somewhere! Mind if I post it on my website? (don't
worry, no one goes there anyway) :)

Sure, feel free. Once you post something on usenet, anybody can
pretty much do anything they want with it. I do prefer if people
don't use what I write to line birdcages with, but have no control
over that either :)
 
J

John Salerno

Roy said:
Sure, feel free. Once you post something on usenet, anybody can
pretty much do anything they want with it. I do prefer if people
don't use what I write to line birdcages with, but have no control
over that either :)

Thanks! And don't worry about that last part, no birds here. :)
 
C

Casey Hawthorne

O(n^0), which is almost always written as O(1). This is a "constant
time" algorithm, one which takes the same amount of steps to execute
no matter how big the input is. For example, in python, you can
write, "x = 'foo'". That assignment statement takes the same amount
of time no matter how long the string is. All of these execute in the
same number of steps:

x = ''
x = 'foo'
x = 'a very long string with lots and lots of characters'

We can say that "assignment is constant time", or "assignment is
O(1)".

Constant time if converted to byte code or compiled?

O(n) in string length if being interpreted?
 
R

Roy Smith

Casey Hawthorne said:
Constant time if converted to byte code or compiled?

O(n) in string length if being interpreted?

Compiled, byte-code, or interpreted has nothing to do with it.

Assignment would be O(n) if it involved copying the string (as it might in
C++ using std:string, for example), but in Python, assignnment simply
involves binding a new name to an existing object.
 
B

bruno at modulix

Ted said:
Thank you Roy.

It seems if you lurk here long enough you eventually get all you
questions answered without even asking!
;-)
+1 QOTW

<OT>
please avoid top-posting, and please avoid posting back a long message
just to add three lines.
</OT>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top