For performance, write it in C - Part 3, Source code now available

T

Thomas E Enebo

Only if by "vanilla" you mean optimised Java.

Did you look at Isaac's implementation? It is almost exactly the same it
was before, but he is not printing out the latin squares as unicode (which
none of the other impls are doing to my knowledge) and it is writing to
a BufferedReader (which any Java programmer with a couple months of
experience would/should do).

I think of things like manual loop unrolling and other similiar techniques
as optimization techniques. The stuff he did was common sense and I think
closer to the C impl in action (these languages are not the same -- some
consideration should be taken in this).

My only question on this thread about Java is what Java is being used
for the test (used on the webpage -- a good addition to the page)? I ran
Java 1.4, Java 5 and Java 6 (beta but what the hell). The speedup was
surprising and exciting (obviously I do Java programming). I went from
1.1s to 0.84s between the versions.

-Tom
 
I

Isaac Gouy

Jon said:
Only if by "vanilla" you mean optimised Java.

What things in that program do you consider optimizations?

I also provided a Java program with the same inner loop code as the
OCaml program but it hasn't appeared on Peter Hickman's website yet.
 
I

Isaac Gouy

Peter said:
This is a good point but in all the posts where I have mentioned the
time taken I have used the real time (which is the best case from 10
runs - except for the first Perl version) all the way back to the first
and fourth versions in Perl (473 and 12 minutes). However it does not
affect the ordering except to push my improved C version ahead of
William James' revised Ocaml version by just 0.029 of a second.

This late in the proceedings it would only confuse the issue to start
saying that the first Perl version ran in 251 minutes when 473 minutes
has been mentioned several times.

Having repeatedly made a mistake in the past is no reason to continue
making the same mistake in the future!

(You're timing for Perl includes 3 or 4 hours of you surfing the web or
installing OCaml or whatever!)
 
I

Isaac Gouy

showing how it performs on a 2, 4, and 8 grid would show how it scaled...

-a

It scales badly

0.496s gcc 5x5 (without print statements)
30055.098s gcc 6x6 (without print statements)

imo If Peter has figured out what he's trying to do then it would be
smart to find a better way of doing it rather than scrabbling for small
percentage improvements.
 
I

Isaac Gouy

showing how it performs on a 2, 4, and 8 grid would show how it scaled...

-a

It scales badly

0.496s gcc 5x5 (without print statements)
30055.098s gcc 6x6 (without print statements)

imo If Peter has figured out what he's trying to do then it would be
smart to find a better way of doing it rather than scrabbling for small
percentage improvements.
 
A

ara.t.howard

It scales badly

0.496s gcc 5x5 (without print statements)
30055.098s gcc 6x6 (without print statements)

imo If Peter has figured out what he's trying to do then it would be
smart to find a better way of doing it rather than scrabbling for small
percentage improvements.

indeed - that's my point exactly.

with that kind scaling a better ruby version could kill the c one!

-a
 
J

Jon Harrop

Thomas said:
Did you look at Isaac's implementation? It is almost exactly the same
it
was before, but he is not printing out the latin squares as unicode (which
none of the other impls are doing to my knowledge) and it is writing to
a BufferedReader (which any Java programmer with a couple months of
experience would/should do).
Yes.

I think of things like manual loop unrolling and other similiar
techniques
as optimization techniques.

They are all optimisations.
The stuff he did was common sense and I think
closer to the C impl in action (these languages are not the same -- some
consideration should be taken in this).

These languages are not the same. However, using "ordinary" IO via
System.out.println or cout or print_endline is the "vanilla" way of
outputting in all of these languages. Of these languages, only Java gives
appalling performance unless you switch to buffers.

Even when you do use buffers in Java, it is still slower than unbuffered
C++, OCaml etc. on my machine.
My only question on this thread about Java is what Java is being used
for the test (used on the webpage -- a good addition to the page)?

I was using Sun's J2SE 1.5. I also tried GNU's GIJ, which is even slower
than OCaml's bytecode.
 
I

Isaac Gouy

Jon Harrop wrote:
-snip-
These languages are not the same. However, using "ordinary" IO via
System.out.println or cout or print_endline is the "vanilla" way of
outputting in all of these languages. Of these languages, only Java gives
appalling performance unless you switch to buffers.

I guess you'd fail a Java exam :)

(I'm not interested enough to find out, but I suspect stdout is at
minimum line-buffered.)

Even when you do use buffers in Java, it is still slower than unbuffered
C++, OCaml etc. on my machine.

No where near an order of magnitude slower - iirc you reported 1.85x C

I was using Sun's J2SE 1.5. I also tried GNU's GIJ, which is even slower
than OCaml's bytecode.

You failed the Java exam again :)
For Java without JIT use the -Xint command line option with the Sun JVM
like this

time java -Xint Latin > /dev/null


And if the program ran for any amount of time we of course be using

time java -server Latin > /dev/null
 
J

Jon Harrop

Isaac said:
(I'm not interested enough to find out, but I suspect stdout is at
minimum line-buffered.)

Sure, the naive implementations in other languages are doing silly things as
well (like the OCaml flushing after every line) but they are still faster
than the most optimised Java to date.
No where near an order of magnitude slower - iirc you reported 1.85x C

I get 0.604s unoptimised OCaml vs 1.740s unoptimised Java on x86_64. That's
2.9x slower.
You failed the Java exam again :)

For Java without JIT use the -Xint command line option with the Sun JVM
like this

time java -Xint Latin > /dev/null

That's as slow as ocamlc.
And if the program ran for any amount of time we of course be using

time java -server Latin > /dev/null

I already tried and it improved performance slightly.
 
C

Chad Perrin

Did you look at Isaac's implementation? It is almost exactly the same it
was before, but he is not printing out the latin squares as unicode (which
none of the other impls are doing to my knowledge) and it is writing to
a BufferedReader (which any Java programmer with a couple months of
experience would/should do).

Not to get embroiled in a nascent flamewar, or anything, but . . .

I don't think the fact that everyone in one language's community does
something that people in other languages' communities don't so they can
get better performance makes what they're doing any less an
optimization. It just makes it an exceedingly common optimization (and
might indicate that language implementation defaults should be changed).

Then again, maybe not.
 
I

Isaac Gouy

Jon said:
Sure, the naive implementations in other languages are doing silly things as
well (like the OCaml flushing after every line) but they are still faster
than the most optimised Java to date.

afaict none of the OCaml implementations work with the data created by
that Perl script - like the C and Java implementations - why is that?
I get 0.604s unoptimised OCaml vs 1.740s unoptimised Java on x86_64. That's
2.9x slower.


That's as slow as ocamlc.

as slow as? does that mean faster :)
 
K

Kristof Bastiaensen

indeed - that's my point exactly.

with that kind scaling a better ruby version could kill the c one!

Did you actually run the 6x6 version? I think my Ruby version might
actually be faster than the C one for large squares, since it only
computes normalised squares. A reasonable estimate indicates that
for 6x6 squares it will take about 23500 seconds on my computer. But I
didn't test it, because that's still longer than I am prepared to wait for
it :)

Kristof
 
J

Jon Harrop

Isaac said:
afaict none of the OCaml implementations work with the data created by
that Perl script - like the C and Java implementations - why is that?

The authors couldn't be bothered because it is easier to generate that data
in OCaml.
as slow as? does that mean faster :)

No, that means even slower.
 
P

Peter Hickman

Thomas said:
My only question on this thread about Java is what Java is being used
for the test (used on the webpage -- a good addition to the page)? I ran
Java 1.4, Java 5 and Java 6 (beta but what the hell). The speedup was
surprising and exciting (obviously I do Java programming). I went from
1.1s to 0.84s between the versions.

-Tom
[peterhickman]$ javac -version
javac 1.5.0_06

I think that's the current best version for the Mac. It does tend to lag
behind the rest of the world when it comes to releases.

I will stick a language versions page up tonight.
 
P

Peter Hickman

Actually this is a miss reading. The whole point of the first post was
to point out that the speed up from going from Perl (or any scripting
language) to C. I used the same Perl code to pre compute values to
create the C header file because I already had code at this point that
did the work and the impact of doing the pre compute was pretty much
insignificant to the overall run time. I then used to same technique to
get the Java implementation. I have never claimed that the C version was
the best possible implementations only that it delivered a massive
performance boost. I think that I have proven that.

There are optimisations that could be applied to the C version but they
are only going to get percentage improvements and are getting much
harder to reliably measure.

As to Ocaml versions, well they were submitted and proved to be faster
than my C version. C is quite a fragile language to develop in and so
any high level language that can deliver that performance is very
interesting. To me at least.

And all this Java stuff, well that comes from the post by Charles O Nutter:
- Write it in C is as valid as write it in Java (as someone else mentioned).
Java is at least as fast as C for most algorithms.

It was the "at least as fast as" bit that I disputed and the whole Java
sub thread comes from that.

The only requirement is that I can get an implementation of the language
in question on my Macintosh.
 
I

Isaac Gouy

Peter said:
Actually this is a miss reading.

What is a misreading?
The whole point of the first post was
to point out that the speed up from going from Perl (or any scripting
language) to C. I used the same Perl code to pre compute values to
create the C header file because I already had code at this point that
did the work and the impact of doing the pre compute was pretty much
insignificant to the overall run time. I then used to same technique to
get the Java implementation. I have never claimed that the C version was
the best possible implementations only that it delivered a massive
performance boost. I think that I have proven that.

Now I'm puzzled - isn't that just conventional wisdom? Isn't it
conventional wisdom that a program transformed by an optimizing
compiler will be faster than a program interpreted at runtime?

We can see that by looking at the same C program using both a compiler
and an interpreter like CINT.
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=cint&lang2=gcc

CINT scripts can call compiled code, which takes us that next step -
it's faster when the script glues together compiled functions, so it's
good to have a large library of highly-optimised compiled functions.

We can figure that out without a programming language flame war :)

There are optimisations that could be applied to the C version but they
are only going to get percentage improvements and are getting much
harder to reliably measure.

As to Ocaml versions, well they were submitted and proved to be faster
than my C version. C is quite a fragile language to develop in and so
any high level language that can deliver that performance is very
interesting. To me at least.

And all this Java stuff, well that comes from the post by Charles O Nutter:


It was the "at least as fast as" bit that I disputed and the whole Java
sub thread comes from that.

Here's what Charles O Nutter said
http://groups.google.com/group/ruby-talk-google/msg/1adf7bdc7ab8aae0

It's not obvious to me what he meant by "Write it in C is as valid as
write it in Java" but the points I'd take from X is at least as fast as
Y for most algorithms are

- it can be true that using the particular implementations of languages
X and Y on his computer X is at least as fast as Y, and using the same
or different implementations of languages X and Y on your computer X is
not as fast as Y

- we don't care about "most algorithms" just the particular ones we
deal with


Now I've read that old post, I understand why he called your Java
program bogus. With apology, it must have seemed that you were
deliberately writing bad Java to prove your point. ;-)

The only requirement is that I can get an implementation of the language
in question on my Macintosh.

D and Clean are other languages you might look at, but using a
different language won't make what you're trying to do any more
feasible.

In your original post, you listed grid sizes with the number of
permutations but you only listed N! - the permutations of a single row
1..N

afaict your approach is to enumerate every distinct latin square of
side N, from http://mathworld.wolfram.com/LatinSquare.html

size : squares
1 : 1! x 0! x 1
2 : 2! x 1! x 1
3 : 3! x 2! x 1
4 : 4! x 3! x 4
5 : 5! x 4! x 56
6 : 6! x 5! x 9408
7 : 7! x 6! x 16942080
8 : 8! x 7! x 535281401856
9 : 9! x 8! x 377597570964258816

I don't know much about algorithm analysis but I'd guess that
generating 5524751496156892842531225600 latin squares is a bad idea.

Maybe you can use some of the tricks from this paper "Enumerating
possible Sudoku grids" - the authors programs are available here
http://www.afjarvis.staff.shef.ac.uk/sudoku/bertram.html
 
T

Tim Bray

The conclusion of this lengthy and interesting thread is obvious: the
original claim, "For performance, write it in C", has been convincingly
debunked.

Along with all the other conventional wisdom about premature
optimization and profiling and 80/20 points, we have seen it
demonstrated that several other languages, of many different types, can
come so close to C's performance that it's difficult to measure the
difference.

The *real* lesson is: Achieving good performance is hard; it requires
analysis and experimentation, and any simple prescription (e.g. "Use
programming language X") is dangerously misleading.

And finally: In the real world, the only benchmark that matters is your
application.

-Tim
 
I

Isaac Gouy

Tim said:
The conclusion of this lengthy and interesting thread is obvious: the
original claim, "For performance, write it in C", has been convincingly
debunked.

Along with all the other conventional wisdom about premature
optimization and profiling and 80/20 points, we have seen it
demonstrated that several other languages, of many different types, can
come so close to C's performance that it's difficult to measure the
difference.

The *real* lesson is: Achieving good performance is hard; it requires
analysis and experimentation, and any simple prescription (e.g. "Use
programming language X") is dangerously misleading.

And finally: In the real world, the only benchmark that matters is your
application.

-Tim

I'll accept your *real* lesson, and add one more ;-)

Although "out of the box thinking" is a sickening cliche, our thinking
does often get stuck in box. In this case, we're stuck in a box
exhaustively generating latin squares.

Once we loosen some constraints and push back the problem boundaries,
maybe we'll see that there are so many latin squares for 9x9 that they
might as well be infinite.

Maybe we'll remember that we have techniques for dealing with infinite
sequences - lazy evaluation, streams, generators... let's assume that
we have a 9x9 lazy latin square generator that we can pause and resume.
Then the interesting question is how do we write generators that
produce different traversals - random traversal, lexicographic
traversal, ...

Now the performance question has changed - are the latin squares
streams being consumed faster than the squares are generated, or is the
bottleneck somewhere else?

Man, that "out of the box thinking" is scary stuff ! :)
 
P

Peter Hickman

Isaac said:
What is a misreading?

That all the implementations had to use the same Perl pre compute phase.
That was just the way I did mine and I think that people have taken it
as read that this is a requirement. Despite the Ruby Ocaml and C++
versions not doing this.
Now I'm puzzled - isn't that just conventional wisdom? Isn't it
conventional wisdom that a program transformed by an optimizing
compiler will be faster than a program interpreted at runtime?

It's the degree of speed up that I felt that people are overlooking,
lets take the following:

Ruby 10.060 (user + sys)
C 3.350
Ocaml 2.558

C doesn't just shave a few percentage points off, you get a massive
boost without having to do anything weird. C seems to be /too hard/ for
many people, especially people who have never used it.

However Ocaml is an eye opener.
D and Clean are other languages you might look at, but using a
different language won't make what you're trying to do any more
feasible.
True, but as I've said before C is a fragile language to work in if
you've come from a scripting background. A program will compile and then
crash with not the slightest hint at what went wrong, so any language
that will allow me to get that sort of performance with a better support
is a good thing. I will be learning Ocaml for this reason alone.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,211
Messages
2,571,092
Members
47,693
Latest member
david4523

Latest Threads

Top