Python and STL efficiency

M

Mc Osten

Tim N. van der Leeuw said:
I'm curious though, if on the OP's machine the slowed-down Python
version is still faster than the C++ version.

I tested both on my machine (my other post in the thread)
 
F

Fredrik Lundh

Mc Osten said:
In fact Python here is faster. Suppose it has a really optimized set
class...

Python's memory allocator is also quite fast, compared to most generic
allocators...

</F>
 
M

Mc Osten

Fredrik Lundh said:
Python's memory allocator is also quite fast, compared to most generic
allocators...

In fact also in the two "slow" versions Python outperforms C++.
I didn't notice it in the first place.
 
T

Tim N. van der Leeuw

Mc said:
In fact also in the two "slow" versions Python outperforms C++.
I didn't notice it in the first place.

But your C++ program outputs times in seconds, right? So all
compilations except for the first two give results in less than a
second, right? (meaning the optimizations of your standard-compilation
give worst results than -O3?)

BTW, I don't quite understand your gcc optimizations for the first 2
compiles anyways: two -O options with different values. Doesn't that
mean the 2nd -O takes preference, and the compilation is at -O2 instead
of -O3?

Why both -O3 and -O2 at the command-line?

Cheers,

--Tim

 
T

Tim N. van der Leeuw

Mc said:
In fact also in the two "slow" versions Python outperforms C++.
I didn't notice it in the first place.

Well, I guess I'm getting really obsessed with this. But anyways. I
installed MinGW on my Windows-XP (sp2) laptop. It is g++ version 3.4.5
-- ancient, yes, but on windows it's the latest available.

I compiled Mc Osten's C++ program (tweaked the output a little) and ran
it; ran his version of the python code too.
Oh boy; yes indeed the slow python is faster than the fast C++
version... Must be something really awful happening in the STL
implementation that comes with GCC 3.4!

Here's the output from my console:

LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ g++ -O3 -march=pentium-m -o SpeedTest SpeedTest.cpp

LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ ./SpeedTest.py
Begin Test
Number of unique string objects: 4
so long...
What do you know
fool
chicken crosses road
Number of unique string objects: 40000
so long...
What do you know
fool
chicken crosses road
Fast - Elapsed: 0.037574 seconds
Slow - Elapsed: 0.081520 seconds

LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ ./SpeedTest.exe
Begin Test
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Fast - Elapsed: 2.089 seconds
Slow - Elapsed: 6.303 seconds

LeeuwT@nlshl-leeuwt ~/My Documents/Python


Cheers,

--Tim
 
J

Jeremy Sanders

Mc said:
Here some results (I know that the fpoint optimizations are useless...
it's is my "prebuilt" full optimization macro :) ):

Interesting. The opimisation makes no difference to the speed of the C++ one
for me. I just get

xpc17:~> g++4 -O2 test2.cpp
xpc17:~> ./a.out
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Elapsed 2.11
Elapsed 1.11

(This is with an Althon 64 4600+ running Linux).

Unfortunately the Python on this computer doesn't have set as it is too old,
so I can't compare it.
 
M

Mc Osten

But your C++ program outputs times in seconds, right? So all
compilations except for the first two give results in less than a
second, right? (meaning the optimizations of your standard-compilation
give worst results than -O3?)

Yes. It's in seconds but the benchmark that are one order of magnitudo
less than the others have of a different "size" (100000 instead of
1000000). That is cut and paste from my terminal... I think it's a
mess. I do it all again from scratch.
BTW, I don't quite understand your gcc optimizations for the first 2
compiles anyways: two -O options with different values. Doesn't that
mean the 2nd -O takes preference, and the compilation is at -O2 instead
of -O3?
Why both -O3 and -O2 at the command-line?

I forgot I put -O2 in my $FAST_FLAGS. I don't know what I was thinking
about.

This the correct version

$ g++ -Os -pipe -march=pentium-m -msse3 -fomit-frame-pointer
-mfpmath=sse -o set_impl set_impl.cpp

$ ./set_impl
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Elapsed 6.3
Elapsed 2.1

$ g++ -O2 -pipe -march=pentium-m -msse3 -fomit-frame-pointer
-mfpmath=sse -o set_impl set_impl.cpp
$ ./set_impl
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Elapsed 5.8
Elapsed 1.7

$ g++ -O3 -pipe -march=pentium-m -msse3 -fomit-frame-pointer
-mfpmath=sse -o set_impl set_impl.cpp
$ ./set_impl
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Elapsed 5.79
Elapsed 1.72

$ g++ -pipe -march=pentium-m -msse3 -fomit-frame-pointer -mfpmath=sse
-o set_impl set_impl.cpp
$ ./set_impl
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Elapsed 7.12
Elapsed 2.98

$ python -O set_impl.py
so long...
What do you know
fool
chicken crosses road
so long...
What do you know
fool
chicken crosses road
Elapsed: 1.370000 seconds
Elapsed: 3.800000 seconds
 
M

Mc Osten

Tim N. van der Leeuw said:
Oh boy; yes indeed the slow python is faster than the fast C++
version... Must be something really awful happening in the STL
implementation that comes with GCC 3.4!

And the Python version does the very same number of iterations than the
C++ one? I suppose they are looping on arrays of different sizes, just
like my "first version".
 
T

Tim N. van der Leeuw

Mc said:
And the Python version does the very same number of iterations than the
C++ one? I suppose they are looping on arrays of different sizes, just
like my "first version".

Hmmm.. You're quite right. The C++ version had an array size 100.000
(your version), the Python version still had an array size 10.000 (as
in my modified copy of the original version).

When fixing the Python version to have 100.000 items, like the C++
version, the Python timings are:

Begin Test
Number of unique string objects: 4
so long...
What do you know
fool
chicken crosses road
Number of unique string objects: 400000
so long...
What do you know
fool
chicken crosses road
Fast - Elapsed: 0.512088 seconds
Slow - Elapsed: 1.139370 seconds

Still twice as fast as the fastest GCC 3.4.5 compiled version!

Incidentally, I also have a version compiled with VC++ 6 now... (not
yet w/VC++ 7) .. Compiled with release-flags and maximum optimization
for speed, here's the result of VC++ 6:

LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ ./SpeedTest_VC.exe
Begin Test
What do you know?
chicken crosses road
fool
so long...
What do you know?
chicken crosses road
fool
so long...
Fast - Elapsed: 4.481 seconds
Slow - Elapsed: 4.842 seconds

So you can see that it's 'slow' version of the code is faster than the
'slow' version compiled with GCC, but the 'fast' code is barely faster
than the 'slow' code! And the 'fast' version compiled with GCC is much
faster than the 'fast' version compiled with VC++ 6!

My conclusion from that is, that the vector<> or set<> implementations
of GCC are far superior to those of VC++ 6, but that memory allocation
for GCC 3.4.5 (MinGW version) is far worse than that of MSCRT / VC++ 6.
(And Python still smokes them both).

Cheers,

--Tim


 
T

Tim N. van der Leeuw

Tim said:
Mc said:
And the Python version does the very same number of iterations than the
C++ one? I suppose they are looping on arrays of different sizes, just
like my "first version".

Hmmm.. You're quite right. The C++ version had an array size 100.000
(your version), the Python version still had an array size 10.000 (as
in my modified copy of the original version).

When fixing the Python version to have 100.000 items, like the C++
version, the Python timings are:
[...]
Fast - Elapsed: 0.512088 seconds
Slow - Elapsed: 1.139370 seconds

Still twice as fast as the fastest GCC 3.4.5 compiled version!

Incidentally, I also have a version compiled with VC++ 6 now... (not
yet w/VC++ 7) .. Compiled with release-flags and maximum optimization
for speed, here's the result of VC++ 6:

LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ ./SpeedTest_VC.exe [...]
Fast - Elapsed: 4.481 seconds
Slow - Elapsed: 4.842 seconds
[...]

And the results of IronPython (1.0rc2) are just in as well:

IronPython 1.0.60816 on .NET 2.0.50727.42
Copyright (c) Microsoft Corporation. All rights reserved.Begin Test
Number of unique string objects: 4
What do you know
so long...
chicken crosses road
fool
Number of unique string objects: 400000
What do you know
so long...
chicken crosses road
fool
Fast - Elapsed: 1.287923 seconds
Slow - Elapsed: 4.516272 seconds

And for Python 2.5:
LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ /cygdrive/c/Python25/python.exe SpeedTest.py
Begin Test
Number of unique string objects: 4
so long...
What do you know
fool
chicken crosses road
Number of unique string objects: 400000
so long...
What do you know
fool
chicken crosses road
Fast - Elapsed: 0.440619 seconds
Slow - Elapsed: 1.095341 seconds

LeeuwT@nlshl-leeuwt ~/My Documents/Python

But beware! For Python2.5 I had to change the code slightly, because it
already realized that the expression

'%s' % 'something'

will be a constant expression, and evaluates it once only... so I had
to replace '%s' with a variable, and I got the timings above which show
Python2.5 to be slightly faster than Python2.4.

(Next step would be to create a VB version and a Java version of the
same program, oh and perhaps to try a version that would work with
Jython... perhaps somehow w/o the 'set')

Cheers,

--Tim
 
S

skip

Tim> But beware! For Python2.5 I had to change the code slightly,
Tim> because it already realized that the expression

Tim> '%s' % 'something'

Tim> will be a constant expression, and evaluates it once only... so I
Tim> had to replace '%s' with a variable, and I got the timings above
Tim> which show Python2.5 to be slightly faster than Python2.4.

Shouldn't you then get rid of any compiler optimizations your C++ compiler
does? Why penalize 2.5 because it recognizes a useful optimization?

Tim> (Next step would be to create a VB version and a Java version of
Tim> the same program, oh and perhaps to try a version that would work
Tim> with Jython... perhaps somehow w/o the 'set')

I don't recall the example exactly, but couldn't you just create a set class
that uses a dict under the covers and only implement the methods you need
for the test?

Skip
 
M

Maric Michaud

Le mardi 22 août 2006 12:55, Mc Osten a écrit :
In fact Python here is faster. Suppose it has a really optimized set
class...

Maybe I'm missing something but the posted c++codes are not equivalent IMO to
what python is doing. I discarded the "slow" version, and tried to get the
equivalent in c++ of :

"""
#!/usr/bin/env python

size = 1000000

def f():
a = []
for i in range(size):
a.append('What do you know')
a.append('so long...')
a.append('chicken crosses road')
a.append('fool')
b = set(a)
for s in b:
print s

import time
from time import clock

f_start = clock()
f()
f_end = clock()

print "Elapsed: %f seconds" % (f_end - f_start)
"""

I came at first with the following, which is still slower than the python
version :

"""
void print_occurence_of_unique_strings(){
vector<string> a;
const string& s1 = "What do you know?" ;
const string& s2 = "so long..." ;
const string& s3 = "chicken crosses road";
const string& s4 = "fool" ;
for (long int i=0; i<SIZE ; ++i){
a.push_back(s1);
a.push_back(s2);
a.push_back(s3);
a.push_back(s4);
}
set<string> b(a.begin(), a.end());
copy(b.begin(), b.end(),
ostream_iterator<string>(cout, "\n"));
}
"""

Here, all strings, while passed by reference to the vector, are copied one by
one.
Then, I tried this, it just overcome the performance of python code, but not
in the proportion I expected :

"""
void print_occurence_of_unique_strings_compare_by_adresses(){
vector<string*> a;
string s1 = "What do you know?";
string s2 = "so long...";
string s3 = "chicken crosses road";
string s4 = "fool";
for (long int i=0; i<SIZE ; ++i){
a.push_back(&s1);
a.push_back(&s2);
a.push_back(&s3);
a.push_back(&s4);
}
set<string> b;
for (vector<string*>::iterator it=a.begin(); it!=a.end(); it++)
b.insert(**it);
copy(b.begin(), b.end(), ostream_iterator<string>(cout, "\n"));
}
"""

The problem here, is that the strings in the set are compared by value, which
is not optimal, and I guess python compare them by adress ("s*n is s*n" has
the same complexity than "s*n == s*n" in CPython, right ?).

so, finally, the code in c++, about ten times faster than the equivalent in
python, must be :

"""
void print_occurence_of_unique_strings_compared_by_address(){
cout << "print_occurence_of_unique_strings_compared_by_address" << endl;
vector<string*> a;
string s1 = "What do you know?";
string s2 = "so long...";
string s3 = "chicken crosses road";
string s4 = "fool";
for (long int i=0; i<SIZE ; ++i){
a.push_back(&s1);
a.push_back(&s2);
a.push_back(&s3);
a.push_back(&s4);
}
set<string*> b(a.begin(), a.end());
set<string> c; // well ordered set (b is ordered by address)
for (set<string*>::iterator it=b.begin(); it!=b.end(); it++)
c.insert(**it);
copy(c.begin(), c.end(), ostream_iterator<string>(cout, "\n"));
}
"""

the result on my box is :

maric@redflag2 mar aoû 22 22:24:23:~$ g++ -O3 -o testcpp testcpp.cpp
maric@redflag2 mar aoû 22 22:24:29:~$ ./testcpp
print_occurence_of_strings
What do you know?
chicken crosses road
fool
so long...
print_occurence_of_unique_strings
What do you know?
chicken crosses road
fool
so long...
print_occurence_of_unique_strings_compared_by_address
What do you know?
chicken crosses road
fool
so long...
strings : 1.89
unique strings : 0.87
compared by address : 0.18
maric@redflag2 mar aoû 22 22:24:38:~$ python2.4 testpython.py
so long...
What do you know
fool
chicken crosses road
Elapsed: 1.680000 seconds
maric@redflag2 mar aoû 22 22:24:51:~$ g++ -v
Using built-in specs.
Target: i486-linux-gnu
Configured
with: ../src/configure -v --enable-languages=c,c++,java,fortran,objc,obj-c++,ada,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-4.1-1.4.2.0/jre --enable-mpfr --with-tune=i686 --enable-checking=release
i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20060613 (prerelease) (Debian 4.1.1-5)

I've joined the full c++ file as an attachment.

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
 
T

Tim N. van der Leeuw

Tim> But beware! For Python2.5 I had to change the code slightly,
Tim> because it already realized that the expression

Tim> '%s' % 'something'

Tim> will be a constant expression, and evaluates it once only... so I
Tim> had to replace '%s' with a variable, and I got the timings above
Tim> which show Python2.5 to be slightly faster than Python2.4.

Shouldn't you then get rid of any compiler optimizations your C++ compiler
does? Why penalize 2.5 because it recognizes a useful optimization?
The point is that I was trying to create 400.000 string instances. The
extra optimization in 2.5 required an extra trick for that.
The idea is to compare a C++ version which creates 400.000 string
instances, with a Python version which creates 400.000 string
instances; then reduce those 400.000 instances to a set of only 4
unique strings.
(So I cannot just create a list with strings generated from numbers 1 -
400.000, and I didn't want to change the original code too much, so I
just added a trick to make Python allocate a new string each time
round.)

I agree that Python2.5 recognized a useful optimization, and didn't
wish to penalize it for that, however the optimalization was defeating
the purpose of my code in the first place!

Cheers,

--Tim
 
F

Fredrik Lundh

Maric said:
The problem here, is that the strings in the set are compared by value, which
is not optimal, and I guess python compare them by adress ("s*n is s*n" has
the same complexity than "s*n == s*n" in CPython, right ?).
wrong.

> timeit -s"s='x'; n=1000" "s*n is n*s"
1000000 loops, best of 3: 1.9 usec per loop
> timeit -s"s='x'; n=1000" "s*n == n*s"
100000 loops, best of 3: 4.5 usec per loop

</F>
 
T

Tim N. van der Leeuw

Maric said:
Le mardi 22 août 2006 12:55, Mc Osten a écrit :

Maybe I'm missing something but the posted c++codes are not equivalent IMO to
what python is doing. I discarded the "slow" version, and tried to get the
equivalent in c++ of :

Your C++ version got me the following timings (using gcc 3.4.5 as the
compiler, MinGW version, with -O6):

LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ ./testcpp.exe
print_occurence_of_strings
What do you know?
chicken crosses road
fool
so long...
print_occurence_of_unique_strings
What do you know?
chicken crosses road
fool
so long...
print_occurence_of_unique_strings_compared_by_address
What do you know?
chicken crosses road
fool
so long...
strings : 2.135
unique strings : 1.103
compared by address : 0.21


For reference, Python's best time was 0.39 seconds on the same computer
(in the 'fast' version, using only 4 unique string instances).

Hmmm... Can we conclude now that carefully crafted C++ code is about
twice as fast as casually and intuitively written Python code? ;) (Just
kidding here of course)

NB: Your code now tests for address-equality. Does it also still test
for string-equality? It looks to me that it does, but it's not quite
clear to me.

Cheers,

--Tim
 
M

Mc Osten

Tim N. van der Leeuw said:
NB: Your code now tests for address-equality. Does it also still test
for string-equality? It looks to me that it does, but it's not quite
clear to me.

It does it.

set<string*> b(a.begin(), a.end());
set<string> c; // well ordered set (b is ordered by address)
for (set<string*>::iterator it=b.begin(); it!=b.end(); it++)
c.insert(**it);
copy(c.begin(), c.end(), ostream_iterator<string>(cout, "\n"));

When we populate the first set, we get rid of all strings with same
object id/address (it test equality of pointers). Then we populate
another set (and the default equality test is on strings).

However, I would have written the code using a proper compare function
rather than using two sets. In this particular case the number of
elements of the first set is negligible in respect of the initial vector
size, thus copying it again does not take a lot of time.
But such code is optimized for the problem itself: in the real world I
suppose we would have passed set a proper comparison function that
checks address and then string equality.
 
M

Mc Osten

Tim N. van der Leeuw said:
My conclusion from that is, that the vector<> or set<> implementations
of GCC are far superior to those of VC++ 6, but that memory allocation
for GCC 3.4.5 (MinGW version) is far worse than that of MSCRT / VC++ 6.
(And Python still smokes them both).

It would be interesting to test it with VC 8 (2005). I have it in my
Parallels vm, but it looks like something is wrong. The very same code
takes almost a minute, I suppose there is something wrong with it
(Python is almost as fast as the python 2.4 on MacOS).
 
M

Mc Osten

Tim N. van der Leeuw said:
And the results of IronPython (1.0rc2) are just in as well:

I can't test this one.
And for Python 2.5:
LeeuwT@nlshl-leeuwt ~/My Documents/Python
$ /cygdrive/c/Python25/python.exe SpeedTest.py
Begin Test
Number of unique string objects: 4
so long...
What do you know
fool
chicken crosses road
Number of unique string objects: 400000
so long...
What do you know
fool
chicken crosses road
Fast - Elapsed: 0.440619 seconds
Slow - Elapsed: 1.095341 seconds


What the heck... you have a Cray, haven't you?
$ /opt/misc/bin/python2.5 -O set_impl.py
so long...
What do you know
fool
chicken crosses road
so long...
What do you know
fool
chicken crosses road
Elapsed: 1.300000 seconds
Elapsed: 1.290000 seconds

Yes... good optimizer work. The 'slow' code here is faster than the fast
one.


$ python -O set_impl.py
so long...
What do you know
fool
chicken crosses road
so long...
What do you know
fool
chicken crosses road
Elapsed: 1.360000 seconds
Elapsed: 3.800000 seconds
(Next step would be to create a VB version and a Java version of the
same program, oh and perhaps to try a version that would work with
Jython... perhaps somehow w/o the 'set')

Ok. I can do the Java version. If I find a RealBasic Set class I can do
it. However, I don't remember anything about VB6, and have done nothing
with .Net.
But I don't think it is that interesting. Java strings are immutable
too: I expect it to outperform Python (unless Java Set class sucks). And
I don't see the point of taking in VB.
A good BASIC implentation is comparable with Pascal or C++ speedwise.
(At least this results from Great Language Shootout and Free Basic).
 
R

Ray

Tim said:
Incidentally, I also have a version compiled with VC++ 6 now... (not
yet w/VC++ 7) .. Compiled with release-flags and maximum optimization
for speed, here's the result of VC++ 6:
<snip>

OK, now I'm getting obsessed with this too ;-)

I'm using VC++ Express, I didn't care to tweak the optimizations, I
merely chose the "Release" configuration for the executable. It's
blazing fast, taking only 30+ ms each run.

Here's the code:

int main(){
DWORD begin = ::GetTickCount();
vector<string> a;
string c = "What do you know?";
string d = "so long...";
string e = "chicken crosses road";
string f = "fool";
for (long int i=0; i<10000 ; ++i){
a.push_back(c);
a.push_back(d);
a.push_back(e);
a.push_back(f);
}
set<string> b(a.begin(), a.end());
unique_copy(b.begin(), b.end(), ostream_iterator<string>(cout,
"\n"));
DWORD end = ::GetTickCount();
cout << "Ends in " << (end - begin) << " ms.";
}

And here's the result:

\TestSTL\release>TestSTL.exe
What do you know?
chicken crosses road
fool
so long...
Ends in 31 ms.

I tried the original version:

int main(){
DWORD begin = ::GetTickCount();
vector<string> a;
for (long int i=0; i<10000 ; ++i){
a.push_back("What do you know?");
a.push_back("so long...");
a.push_back("chicken crosses road");
a.push_back("fool");
}
set<string> b(a.begin(), a.end());
unique_copy(b.begin(), b.end(), ostream_iterator<string>(cout,
"\n"));
DWORD end = ::GetTickCount();
cout << "Ends in " << (end - begin) << " ms.";
}

And the result is only 50% slower:

\TestSTL\release>TestSTL.exe
What do you know?
chicken crosses road
fool
so long...
Ends in 47 ms.
 
C

could.net

That's to say,
python is still much faster?

I am a c++ newbie but I think c++ should be faster here.
Maybe someone can post this to the c++ maillist and they will tell how
to accelerate it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top