Python is slow?

S

sturlamolden

I have recently been playing with a kd-tree for solving the "post
office problem" in a 12-dimensional space. This is pure cpu bound
number crunching, a task for which I suspected Python to be
inefficient.

My prototype in Python 2.5 using NumPy required 0.41 seconds to
construct the tree from 50,000 samples. Unfortunately, searching it
felt a bit slow, finding the 11 nearest-neighbours of 1,000 points
took 29.6 seconds (and there were still 49,000 to go). Naturally, I
blamed this on Python. It would be 100 times faster if I used C++,
right?

After having a working Python prototype, I resorted to rewrite the
program in C++. The Python prototype took an hour to make, debug and
verify. The same thing in C++ took me almost a day to complete, even
with a working prototype as model. To my surprise, the resulting beast
of C++ required 64.3 seconds to construct the same kd-tree. Searching
the tree was not faster either, 1,000 points required 38.8 seconds. I
wasted a day, only to find my Python prototype being the faster.

We may conclude that I'm bad at programming C++, but I suspect that is
not the case here. Albeit micro-benchmarks may indicate that Python is
100-200 times slower than C++, they may not be applicable to the real
world. Python can be very efficient. And when combined with libraries
like NumPy, beating it's performance with hand-crafted C++ is
difficult. At least, my 10 years experience programming scientific
software in various languages was not sufficient to beat my own Python
prototype with C++.

That is not to say I have never seen C++ run a lot faster than Python.
But it tends to be very short pieces of CPU bound code, no more than a
function or two. But as the problem grows in complexity, C++
accumulates too much of its own bloat.
 
R

Robert Singer

I have recently been playing with a kd-tree for solving the "post
office problem" in a 12-dimensional space. This is pure cpu bound
number crunching, a task for which I suspected Python to be
inefficient.

Well, python is not a number crunching language. However much we would
like it to be (we would ? :). No scripting language is.
Developing time is shorter, I agree, but when you have, for example a
problem which takes 8,31 minutes to go through in optimized fortran
code (like the one I had the other day), then that hardly matters.
My prototype in Python 2.5 using NumPy required 0.41 seconds to
construct the tree from 50,000 samples. Unfortunately, searching it
felt a bit slow, finding the 11 nearest-neighbours of 1,000 points
took 29.6 seconds (and there were still 49,000 to go). Naturally, I
blamed this on Python. It would be 100 times faster if I used C++,
right?


Not necessarily.
Before resorting to rewriting the problem try psyco. It speeds up
things sometimes.
Also, (I'm not that familiar with python yet, so I don't know how to
do it in python), try finding the bottlenecks of your calculation. Are
the loops where most of the processing time is wasted, or disk
accessing, or ... ?
After having a working Python prototype, I resorted to rewrite the
program in C++. The Python prototype took an hour to make, debug and
verify. The same thing in C++ took me almost a day to complete, even
with a working prototype as model. To my surprise, the resulting beast
of C++ required 64.3 seconds to construct the same kd-tree. Searching
the tree was not faster either, 1,000 points required 38.8 seconds. I
wasted a day, only to find my Python prototype being the faster.



We may conclude that I'm bad at programming C++, but I suspect that is
not the case here. Albeit micro-benchmarks may indicate that Python is
100-200 times slower than C++, they may not be applicable to the real
world. Python can be very efficient. And when combined with libraries
like NumPy, beating it's performance with hand-crafted C++ is
difficult. At least, my 10 years experience programming scientific
software in various languages was not sufficient to beat my own Python
prototype with C++.

That is not to say I have never seen C++ run a lot faster than Python.
But it tends to be very short pieces of CPU bound code, no more than a
function or two. But as the problem grows in complexity, C++
accumulates too much of its own bloat.

Well, personally, I try to combine fortran (being a fortran programmer
by trade) with python (in the last few years), as I find fortran to
be, by two grades, more comfortable for solving scientific problems
then c (or python for that matter, although it has its merits).
Starting from ith his capabilities for "normal" array handling, to
optimisation and easy readability, to whatnot.


Best regards
Bob
 
G

George Sakkis

[...]
After having a working Python prototype, I resorted to rewrite the
program in C++. The Python prototype took an hour to make, debug and
verify. The same thing in C++ took me almost a day to complete, even
with a working prototype as model. To my surprise, the resulting beast
of C++ required 64.3 seconds to construct the same kd-tree. Searching
the tree was not faster either, 1,000 points required 38.8 seconds. I
wasted a day, only to find my Python prototype being the faster.
We may conclude that I'm bad at programming C++,

AFAICT, _everybody_ is bad at programming C++.

+1 QOTW
 
S

skip

Grant> AFAICT, _everybody_ is bad at programming C++.

Grant> One begins to suspect it's not the fault of the programmers.

+1 QOTW...

Skip
 
B

bearophileHUGS

sturlamolden:

CPython is generally slow (you can see this from the huge amount of
solutions invented to solve the speed problem, like Cython, Numpy,
Psyco, ShedSkin, Weave, Inline, SIP, Boost Python, SWIG, etc etc), but
for most of the usages Python is used for, it's not a significant
problem. I know that sounds like a tautology :)

Well written C++ code is generally faster or much faster than similar
Python code, but programming in Python is often simpler, and it
generally requires less time. So it may happen that to solve a problem
a Python program that runs in 1 hour that requires 1 hour to be
written allows you to find the solution in less time than a C++
program that runs in 5-10 minutes that requires you 3-4 hours to be
written :)

Note that C++ is just one option, but there are other languages
around, like CLisp or D (or even a modern JavaVM), that are often an
acceptable compromise between Python and C/C++.

So you can show us a reduced/minimal working version of your Python
code, so I/someone may find ways to speed it up, translate it to C or C
++ or CLisp or D, etc.

Note that I have written a kd-tree in both Python (with Psyco
compulsively) and D, this is the Psyco version:
http://code.activestate.com/recipes/572156/

Bye,
bearophile
 
S

sturlamolden

Well, python is not a number crunching language. However much we would
like it to be (we would ? :).
No scripting language is.

Not even Matlab, R, IDL, Octave, SciLab, S-PLUS or Mathematica?

Before resorting to rewriting the problem try psyco. It speeds up
things sometimes.

I did, Psyco did not help.

Also, (I'm not that familiar with python yet, so I don't know how to
do it in python), try finding the bottlenecks of your calculation.

I did use a profiler, there is no particular single bottle-neck.

Well, personally, I try to combine fortran (being a fortran programmer
by trade) with python

Good compilers are too expensive, and gfortran is not good enough yet.
 
S

sturlamolden

Well written C++ code is generally faster or much faster than similar
Python code, but programming in Python is often simpler, and it
generally requires less time. So it may happen that to solve a problem
a Python program that runs in 1 hour that requires 1 hour to be
written allows you to find the solution in less time than a C++
program that runs in 5-10 minutes that requires you 3-4 hours to be
written :)

I could easily sit down and wait half an hour. I could even wait a
week for the calculation to complete. But this is a program that will
be used many times, and not just by me.

Note that C++ is just one option, but there are other languages
around, like CLisp or D (or even a modern JavaVM), that are often an
acceptable compromise between Python and C/C++.

Yes, if I knew Lisp, I would have used SBCL. Java I can program. But
it is a pain in the ass for any scientific programming. F# and OCaml
look promising though.


Note that I have written a kd-tree in both Python (with Psyco
compulsively) and D, this is the Psyco version:http://code.activestate.com/recipes/572156/

Sure I could show you the code, Python and C++, if I had a place to
post it. It looks very different form yours though.
 
B

bearophileHUGS

sturlamolden:
F# and OCaml look promising though.<

I bet on the future of D and Haskell (and maybe Fortress) instead :)
We'll see.

Sure I could show you the code, Python and C++, if I had a place to post it.<

I think the Python version suffices. If it's not too much private you
may post the single minimal/reduced runnable Python module here, it
will be deleted in some time (if you want you can also use a private
paste):
http://codepad.org/

It looks very different form yours though.<

Is this a good or bad thing? ;-)

Bye,
bearophile
 
S

sturlamolden

I think the Python version suffices. If it's not too much private you
may post the single minimal/reduced runnable Python module here, it
will be deleted in some time (if you want you can also use a private
paste):http://codepad.org/

http://codepad.org/rh8GzzJT

Available 24 hours from now.

Is this a good or bad thing? ;-)

It's just facinating how different people working on similar problems
come up with different looking code.
 
R

Robert Kern

sturlamolden:

I think the Python version suffices. If it's not too much private you
may post the single minimal/reduced runnable Python module here, it
will be deleted in some time (if you want you can also use a private
paste):
http://codepad.org/

You could also drop it on the scipy.org wiki in the Cookbook category.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

J Peyret

On Sep 23, 8:31 am, (e-mail address removed) wrote:

Guys, this looks like a great data structure/algo for something I am
working on.

But... where do I find some definitions of the original BK-tree idea?
I looked through Amazon
and only a few books mention something like BK-Tree and these are
mostly conference minutes books, at ungodly prices.

I also did a quick Google on it and there isn't that much about the
subject.

http://blog.notdot.net/archives/30-Damn-Cool-Algorithms,-Part-1-BK-Trees.html

is the one I mostly saw referred.

So... 2 questions:

1. More bk-tree references? I can follow the code, but some
understanding of the background would be nice.

2. What, if any, is a good book to understand the basic of fuzzy/
string matching? Proximity/affinity problems? Or, more generally, a
good book on advanced algorithms?

No, I don't wanna read Knuth's just yet, something more modern/easy to
follow maybe? Something like 'Programming Collective Intelligence',
ISBN 0596529325, would be very nice, though it is perhaps a bit too
specific in its applications. Books using Java or C are fine. Lisp,
hmmm, well... I have trouble reading its notation, sorry.

Cheers

JLuc
 
R

Robert Kern

J said:
On Sep 23, 8:31 am, (e-mail address removed) wrote:

Guys, this looks like a great data structure/algo for something I am
working on.

But... where do I find some definitions of the original BK-tree idea?

Uh, actually we're talking about kd-trees, not BK-trees. kd-trees are for
searching through point sets in a k-dimensional space.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
R

Robert Singer

Not even Matlab, R, IDL, Octave, SciLab, S-PLUS or Mathematica?

No. And just to avoid eventual useless discussions which might arise,
I ment to say that *in general* compiled languages are faster. We can
always have discussions whether or not some newer scripting languages
like some from the above list, come close, but that usually is just
wasted time.

Specifically, I cannot say about R, IDL or S-PLUS, since I never used
those (not even heard of IDL till now). Octave and Mathematica have
been with me for such a short time (we had a few licences for
Wolfram's child for one year, but not my part of the company, so ...)
that I would rather not give my opinion about those.
I've used Matlab and Scilab for a longer time (still do actually -
Matlab for measurement data acquisition, and Scilab ... well, it just
sits on the disk somewhere actually), and although Matlab is quite
fast when disk I/O is involved, it still comes far.

I did use a profiler, there is no particular single bottle-neck.

You're talking about your c or your python version of the program?

There is always a bottleneck - that's just the part which works most
slowly. Try to find the part which takes the longest to execute, try
to put it differently. If it cannot be done, go to the next slowest
part.

Good compilers are too expensive, and gfortran is not good enough yet.
?
Gfortran is one of the better compilers on the market. There was, just
the other day, a nice discussion on comp.lang.fortran how it is
marvellous what a group of enthousiasts managed do in their time, what
commercial giants still didn't.
May I ask what are your main objections to it ?

Best regards
Bob
 
K

Kay Schluehr

On Sep 23, 8:31 am, (e-mail address removed) wrote:

Guys, this looks like a great data structure/algo for something I am
working on.

But... where do I find some definitions of the original BK-tree idea?

*geometric data structures*. Just google for it.
 
R

Robert Kern

sturlamolden said:
Yes, if I could figure out how to use it...

What's confusing? You do have to create a profile:

http://www.scipy.org/UserPreferences

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
S

sturlamolden

May I ask what are your main objections to it ?

1. gfortran is not Absoft.

2. If I program the same in C99 and Fortran 95, and compile with gcc
and gfortran, the C99 code runs a lot faster (I've only tested with
wavelet transforms).

3. gfortran is not Absoft.
 
S

sturlamolden

What's confusing? You do have to create a profile:

How do I add a new page to the wiki? I'm only able to edit the front
page of the cookbook. But it doesn't help to add link there if I have
no page to link. (I may be incredibly stupid though.)
 
R

Robert Kern

sturlamolden said:
How do I add a new page to the wiki? I'm only able to edit the front
page of the cookbook. But it doesn't help to add link there if I have
no page to link. (I may be incredibly stupid though.)

You just navigate to the URL you want:

http://www.scipy.org/Cookbook/KDTree

This will show you a page saying:

"""
This page does not exist yet. You can create a new empty page, or use one of the
page templates. Before creating the page, please check if a similar page already
exists.

Create new empty page
"""

The latter is a link that you can click.

Looking at the other Cookbook page sources, I notice that you should add the
following to the bottom of your page to categorize it appropriately.

"""
----
. CategoryCookbook
"""

Then put the link on the main Cookbook page (or do that first, then navigate
through the link to create the page).

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,002
Messages
2,570,261
Members
46,858
Latest member
FlorrieTuf

Latest Threads

Top