D
Dave Brueck
Below is some information I collected from a *small* project in which I wrote
a Python version of a Java application. I share this info only as a data
point (rather than trying to say this data "proves" something) to consider
the next time the "Python makes developers more productive" thread starts up
again.
Background
==========
An employee who left our company had written a log processor we use to read
records from text files (1 record per line), do a bunch of computations on
them, and load them into various tables in our database. Typically each day
there are 1000 to 2000 files with a total number of records being around 100
million. The nature of the work the processor's work (some of it is
summarizing data) results in about 1 database insert or update for about
every 20 "raw" log records.
After this employee left I inherited this chunk of code and the first thing I
did was create a fairly comprehensive test suite for it (in Python) because
there was no test suite and in the past we had had many problems with bugs
and flakiness in the code. As I did this I got pretty familiar with how the
code worked and noticed lots of places where the language seemed to "get in
the way" of what needed to be done. During this time I also began hearing
rumblings about how this processor would need to be extended to add more
functionality, so I thought I'd start rewriting it in Python on the side just
to see how it would turn out (the idea being that if I got it done and it was
good enough then I could simply implement the new functionality in the Python
version ).
Data
====
The Java version is for the Java 1.4 platform (uses the Sun JVM), and it
connects to an Oracle 9 database on our LAN via the thin JDBC driver provided
by Oracle. As far as complexity goes, I'd rate the problem itself as
non-trivial but not rocket science. Implementing it in Java turned out to
be fairly involved; a lot of the complexity there revolved around efficiently
maintaining data structures in memory that didn't get committed to the
database until the end of each separate file. The developer for the Java
version has about as much experience as a developer as I do, and he has about
as much experience with Java as I do with Python (IOW, I don't consider that
to be much of a differentiating factor - although the Java developer has
*way* more database experience than I do).
From the start of the design to the end of the implementation and initial
round of bug fixing took 3 weeks. An additional 2-3 weeks were spent
optimizing the code and fixing more bugs. The source code had 4700 lines in 9
files. When running, it would process an average of 1050 records per second
(where process = time to parse the file, do calculations, and insert them
into the database).
The Python version is for Python 2.1.3 and it connects to the same Oracle
database using DCOracle2. The implementation is very straightforward, nothing
fancy. It's also still pretty "pristine" as I haven't had any time to try and
optimize it yet. One of the biggest simplifying factors in the code was
how easy it was to have dictionaries map dynamically-defined and
arbitrarily-complex keys (made up of object tuples) to other objects. For
whatever reason this seemed to be a huge factor in making data management go
from a difficult problem to almost a no-brainer. If the Python version had
come first and the Java one second, then some of that approach could have
made it into the Java version, but in Java I don't find myself thinking about
problems quite the same way - IOW the Python version's approach wouldn't be
as obvious in Java or would seem to be too much up-front work (whereas the
approach that was used was easier up front but in the end became quite
complex and a limiting factor to adding new functionality IMO).
Because I wasn't working on this full-time, the development was spread out
over the course of two weeks (10 working days) at an average of just over 2
hours per day (for a total of not quite 3 full days of work). The source code
was less than 700 lines in 4 files. Most surprising to me was that it
processes an average of 1200 records per second! I had assumed that after I
got it working I'd need to spend time optimizing things to make up for Java
having a JIT compiler to speed things up, but for now I won't bother.
Both versions could be improved by splitting the log parsing/summarizing and
database work into two separate threads (watching the processes I see periods
of high CPU usage followed by near-idle time while the database churns away).
Currently the Java version averages 47% CPU utilization and the Python
version averages 51%.
Caveats
=======
There are a million reasons why this isn't an apple-to-apple comparison or why
somebody might read this and cry "foul!"; here's a few off the top of my
head:
- The second time you write something you can do it better - since development
is often part exploratory, once you're done you usually have a good idea of
how to do it better were you to do it again. I didn't write the first
version, but writing the test suite (and fixing the bugs it uncovered) made
me familiar with the weaknesses in the initial version.
- It's proprietary code; nobody else can see the two versions of source - Too
bad. ;-)
- I haven't bothered to do a "true" LOC count for both versions - I just did
"wc -l *.py" and "wc -l *.java" to get line counts - so comments and other
junk is included in the line totals.
- My development time didn't include much design time because my design was
mostly a reaction to the Java implementation.
- I used the thin Java driver (written in pure Java) and DCOracle 2 (uses
native driver on Linux) - this may affect performance some.
Conclusion
==========
Like I said before, I'd hesitate to read *too* much into this experience,
although I will say it's a more concrete and confirming example of what I
(and many others) have experienced before - that there are some big
productivity gains by using Python over some other languages. It's generally
hard to measure this sort of thing because it's rare to do a rewrite without
adding functionality, and the larger the project the more rare this becomes,
so even though this is a very small example it's still useful. I certainly
wouldn't have gotten "approval" to do the simple rewrite, but doing it in my
spare time and having a comprehensive test suite made it possible.
Anyway, for a very small development cost I ended up with a codebase 15% of
the size of the original, and an implementation that will be far easier to
extend (and easier to pass off to somebody else!). In order to get smaller
and cleaner code I had been willing to take a modest performance hit, but in
the end the new version was slightly faster too - icing on the cake!
-Dave
a Python version of a Java application. I share this info only as a data
point (rather than trying to say this data "proves" something) to consider
the next time the "Python makes developers more productive" thread starts up
again.
Background
==========
An employee who left our company had written a log processor we use to read
records from text files (1 record per line), do a bunch of computations on
them, and load them into various tables in our database. Typically each day
there are 1000 to 2000 files with a total number of records being around 100
million. The nature of the work the processor's work (some of it is
summarizing data) results in about 1 database insert or update for about
every 20 "raw" log records.
After this employee left I inherited this chunk of code and the first thing I
did was create a fairly comprehensive test suite for it (in Python) because
there was no test suite and in the past we had had many problems with bugs
and flakiness in the code. As I did this I got pretty familiar with how the
code worked and noticed lots of places where the language seemed to "get in
the way" of what needed to be done. During this time I also began hearing
rumblings about how this processor would need to be extended to add more
functionality, so I thought I'd start rewriting it in Python on the side just
to see how it would turn out (the idea being that if I got it done and it was
good enough then I could simply implement the new functionality in the Python
version ).
Data
====
The Java version is for the Java 1.4 platform (uses the Sun JVM), and it
connects to an Oracle 9 database on our LAN via the thin JDBC driver provided
by Oracle. As far as complexity goes, I'd rate the problem itself as
non-trivial but not rocket science. Implementing it in Java turned out to
be fairly involved; a lot of the complexity there revolved around efficiently
maintaining data structures in memory that didn't get committed to the
database until the end of each separate file. The developer for the Java
version has about as much experience as a developer as I do, and he has about
as much experience with Java as I do with Python (IOW, I don't consider that
to be much of a differentiating factor - although the Java developer has
*way* more database experience than I do).
From the start of the design to the end of the implementation and initial
round of bug fixing took 3 weeks. An additional 2-3 weeks were spent
optimizing the code and fixing more bugs. The source code had 4700 lines in 9
files. When running, it would process an average of 1050 records per second
(where process = time to parse the file, do calculations, and insert them
into the database).
The Python version is for Python 2.1.3 and it connects to the same Oracle
database using DCOracle2. The implementation is very straightforward, nothing
fancy. It's also still pretty "pristine" as I haven't had any time to try and
optimize it yet. One of the biggest simplifying factors in the code was
how easy it was to have dictionaries map dynamically-defined and
arbitrarily-complex keys (made up of object tuples) to other objects. For
whatever reason this seemed to be a huge factor in making data management go
from a difficult problem to almost a no-brainer. If the Python version had
come first and the Java one second, then some of that approach could have
made it into the Java version, but in Java I don't find myself thinking about
problems quite the same way - IOW the Python version's approach wouldn't be
as obvious in Java or would seem to be too much up-front work (whereas the
approach that was used was easier up front but in the end became quite
complex and a limiting factor to adding new functionality IMO).
Because I wasn't working on this full-time, the development was spread out
over the course of two weeks (10 working days) at an average of just over 2
hours per day (for a total of not quite 3 full days of work). The source code
was less than 700 lines in 4 files. Most surprising to me was that it
processes an average of 1200 records per second! I had assumed that after I
got it working I'd need to spend time optimizing things to make up for Java
having a JIT compiler to speed things up, but for now I won't bother.
Both versions could be improved by splitting the log parsing/summarizing and
database work into two separate threads (watching the processes I see periods
of high CPU usage followed by near-idle time while the database churns away).
Currently the Java version averages 47% CPU utilization and the Python
version averages 51%.
Caveats
=======
There are a million reasons why this isn't an apple-to-apple comparison or why
somebody might read this and cry "foul!"; here's a few off the top of my
head:
- The second time you write something you can do it better - since development
is often part exploratory, once you're done you usually have a good idea of
how to do it better were you to do it again. I didn't write the first
version, but writing the test suite (and fixing the bugs it uncovered) made
me familiar with the weaknesses in the initial version.
- It's proprietary code; nobody else can see the two versions of source - Too
bad. ;-)
- I haven't bothered to do a "true" LOC count for both versions - I just did
"wc -l *.py" and "wc -l *.java" to get line counts - so comments and other
junk is included in the line totals.
- My development time didn't include much design time because my design was
mostly a reaction to the Java implementation.
- I used the thin Java driver (written in pure Java) and DCOracle 2 (uses
native driver on Linux) - this may affect performance some.
Conclusion
==========
Like I said before, I'd hesitate to read *too* much into this experience,
although I will say it's a more concrete and confirming example of what I
(and many others) have experienced before - that there are some big
productivity gains by using Python over some other languages. It's generally
hard to measure this sort of thing because it's rare to do a rewrite without
adding functionality, and the larger the project the more rare this becomes,
so even though this is a very small example it's still useful. I certainly
wouldn't have gotten "approval" to do the simple rewrite, but doing it in my
spare time and having a comprehensive test suite made it possible.
Anyway, for a very small development cost I ended up with a codebase 15% of
the size of the original, and an implementation that will be far easier to
extend (and easier to pass off to somebody else!). In order to get smaller
and cleaner code I had been willing to take a modest performance hit, but in
the end the new version was slightly faster too - icing on the cake!
-Dave