Confirm my Performance Test Against Java?

B

Ben Christensen

I'm evaluating Ruby for use in a variety of systems that are planned by
default to be Java.

I've started down a path of doing various performance tests to see what
kind of impact will occur by using Ruby and in my first test the numbers
are very poor - so poor that I have to question if I'm doing something
wrong.

I've tried it on both Linux and Mac OSX and get similar performance
numbers on each - differences being hardware, but the ratio between the
results about the same.

Please take a look at my blog post on my test results and view the
source code and let me know if I'm doing something completely wrong with
the Ruby code or execution - or if these are accurate numbers.

http://benjchristensen.com/2009/08/18/initial-impressions-on-ruby-performance/

NOTE: This is not an attempt to start a flame war. This is a legitimate
effort to take a good look at Ruby and let the numbers speak for
themselves in making decisions for what types of applications I can
choose to use Ruby for without sacrificing the performance of a mature
platform such as Java.

Thank you.

Ben
 
P

pharrington

I'm evaluating Ruby for use in a variety of systems that are planned by
default to be Java.

I've started down a path of doing various performance tests to see what
kind of impact will occur by using Ruby and in my first test the numbers
are very poor - so poor that I have to question if I'm doing something
wrong.

I've tried it on both Linux and Mac OSX and get similar performance
numbers on each - differences being hardware, but the ratio between the
results about the same.

Please take a look at my blog post on my test results and view the
source code and let me know if I'm doing something completely wrong with
the Ruby code or execution - or if these are accurate numbers.

http://benjchristensen.com/2009/08/18/initial-impressions-on-ruby-per...

NOTE: This is not an attempt to start a flame war. This is a legitimate
effort to take a good look at Ruby and let the numbers speak for
themselves in making decisions for what types of applications I can
choose to use Ruby for without sacrificing the performance of a mature
platform such as Java.

Thank you.

Ben

Well.... without having put a ton of thought into this... yes, Ruby
(*especially* 1.8 MRI) is slow. No one's going to argue that the Ruby
interpreter is one of the quicker kids around. If performance is the
#1 priority of whatever you'll be developing, Ruby doesn't fit your
needs, and no one will tell you it does. That's what Java (for the
most part) and C are still hanging around for.

What sort of software is in needed of being developed here?

Ask yourself: is it critical that my code always performs as fast as
possible? Or is the greater concern speed of development and project
maintainability?

Also as to the benchmark... can you post your /tmp/file_test.txt?
Posting some benchmarky code isn't very useful if no one can replicate
your results. Reading the whole file into memory may be faster than
reading it line-by-line (but obviously the wrong thing to do if the
file's enormous, which.... 8 secs to read??? i'd better be moved to
tears by the size it.) And not entirely sure what it is you're trying
to benchmark here? Vagggguuee benchmarks are fairly useless, as the
code your timing is never going to be anywhere close to the actual
code you'll write. Are you trying to just compare file reading times?
Benchmark that, and only that. Is there something specific string
manipulation-wise you want to measure? Then... measure that. Until
your code starts getting at least halfway specific, just doing a line-
by-line Java-Ruby conversion doesn't tell anything, as the code that
happens is neither the most "elegant" *nor* fastest Ruby can do.
 
B

brabuhr

I'm evaluating Ruby for use in a variety of systems that are planned by
default to be Java.

I've started down a path of doing various performance tests to see what
kind of impact will occur by using Ruby and in my first test the numbers
are very poor - so poor that I have to question if I'm doing something
wrong.

Is this test case in any way representative of the tasks you will
actually be performing?

Test file 1:
Linux linux116.ctc.com 2.6.18-92.1.22.el5 #1 SMP Tue Dec 16 12:03:43
EST 2008 i686 i686 i386 GNU/Linux
java -version
java version "1.6.0_0"
IcedTea6 1.3.1 (6b12-Fedora-EPEL-5) Runtime Environment (build 1.6.0_0-b12)
OpenJDK Server VM (build 1.6.0_0-b12, mixed mode)
java FileReadParse
Starting to read file...
The number of tokens is: 1954
It took 16 ms
ruby -v file_read_parse.rb
ruby 1.8.6 (2007-09-24 patchlevel 111) [i386-linux]
Starting to read file ...
The number of tokens is: 1954
It took 4.951 ms

Test file 2:
java FileReadParse
Starting to read file...
The number of tokens is: 479623
It took 337 ms
ruby file_read_parse.rb
Starting to read file ...
The number of tokens is: 479623
It took 2526.455 ms
ruby file_read_parse-2.rb
Starting to read file ...
It took 588.065 ms
The number of tokens is: 479623
cat file_read_parse-2.rb
puts "Starting to read file ..."
start = Time.now

tokens = File.new("/tmp/file_test.txt").read.scan(/[^\s]+/)
count = tokens.size

stop = Time.now
puts "It took #{(stop - start) * 1000} ms"
puts "The number of tokens is: #{count}"
 
R

Reid Thompson

Is this test case in any way representative of the tasks you will
actually be performing?

If it is, then you should just do
$ time wc approach.txt
6836 78325 484114 approach.txt

real 0m0.041s
user 0m0.046s
sys 0m0.015s
 
M

Mike Sassak

[Note: parts of this message were removed to make it a legal post.]

I'm evaluating Ruby for use in a variety of systems that are planned by
default to be Java.

I've started down a path of doing various performance tests to see what
kind of impact will occur by using Ruby and in my first test the numbers
are very poor - so poor that I have to question if I'm doing something
wrong.

I've tried it on both Linux and Mac OSX and get similar performance
numbers on each - differences being hardware, but the ratio between the
results about the same.

Please take a look at my blog post on my test results and view the
source code and let me know if I'm doing something completely wrong with
the Ruby code or execution - or if these are accurate numbers.


http://benjchristensen.com/2009/08/18/initial-impressions-on-ruby-performance/

NOTE: This is not an attempt to start a flame war. This is a legitimate
effort to take a good look at Ruby and let the numbers speak for
themselves in making decisions for what types of applications I can
choose to use Ruby for without sacrificing the performance of a mature
platform such as Java.

Hi Ben,

The point everyone keeps bringing up--whether this benchmark is indicative
of what you will actually be doing with Ruby, and whether it is "fast
enough"--is worth considering for any project, but the fact remains that for
many things, Java is going to execute faster than Ruby. You can certainly
optimize Ruby code (and yes, writing Ruby extensions in C is actually pretty
easy), but that's not why many of us love Ruby. We love it because it allows
you to turn FileReadParse.java into this: http://gist.github.com/170466.
Now, in the spirit of good fun:

$ ruby file_read_parse_2.rb file_read_parse_2.rb
Starting to read file ...
The number of tokens is: 39.
It took 0.189 ms

$ ruby file_read_parse_2.rb FileReadParse.java
Starting to read file ...
The number of tokens is: 159.
It took 0.215 ms

See? :)

Good luck with Ruby, and don't be afraid to ask more questions!
Mike
 
M

Mike Sassak

[Note: parts of this message were removed to make it a legal post.]

Argh! That gist should be http://gist.github.com/170476. Sigh...

On Wed, Aug 19, 2009 at 9:31 AM, Ben Christensen <
I'm evaluating Ruby for use in a variety of systems that are planned by
default to be Java.

I've started down a path of doing various performance tests to see what
kind of impact will occur by using Ruby and in my first test the numbers
are very poor - so poor that I have to question if I'm doing something
wrong.

I've tried it on both Linux and Mac OSX and get similar performance
numbers on each - differences being hardware, but the ratio between the
results about the same.

Please take a look at my blog post on my test results and view the
source code and let me know if I'm doing something completely wrong with
the Ruby code or execution - or if these are accurate numbers.


http://benjchristensen.com/2009/08/18/initial-impressions-on-ruby-performance/

NOTE: This is not an attempt to start a flame war. This is a legitimate
effort to take a good look at Ruby and let the numbers speak for
themselves in making decisions for what types of applications I can
choose to use Ruby for without sacrificing the performance of a mature
platform such as Java.

Hi Ben,

The point everyone keeps bringing up--whether this benchmark is indicative
of what you will actually be doing with Ruby, and whether it is "fast
enough"--is worth considering for any project, but the fact remains that for
many things, Java is going to execute faster than Ruby. You can certainly
optimize Ruby code (and yes, writing Ruby extensions in C is actually pretty
easy), but that's not why many of us love Ruby. We love it because it allows
you to turn FileReadParse.java into this: http://gist.github.com/170466.
Now, in the spirit of good fun:

$ ruby file_read_parse_2.rb file_read_parse_2.rb
Starting to read file ...
The number of tokens is: 39.
It took 0.189 ms

$ ruby file_read_parse_2.rb FileReadParse.java
Starting to read file ...
The number of tokens is: 159.
It took 0.215 ms

See? :)

Good luck with Ruby, and don't be afraid to ask more questions!
Mike
 
J

Joel VanderWerf

Mike said:
Argh! That gist should be http://gist.github.com/170476. Sigh...

And you can even, with another ounce of ruby-love, rewrite that as:

num = 0
ARGF.each do |l|
num += l.split.length
end

Then it also works with stdin or multiple filenames on the cmdline.

I'll leave it to others to #inject... ;)
 
B

brabuhr

If it is, then you should just do
$ time wc approach.txt
=A06836 =A078325 484114 approach.txt

:)

I got a little crazy; first the numbers (slower hardware this time):
Linux eXist 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 00:28:35 UTC
2009 i686 GNU/Linux
java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu11)
OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)
java FileReadParse
Starting to read file...
The number of tokens is: 479623
It took 596 ms
/opt/matzruby/trunk/bin/ruby -v -rubygems file_read_parse.rb
ruby 1.9.2dev (2009-08-14 trunk 24539) [i686-linux]
Starting to read file ...
The number of tokens is: 479623
It took 1751.92544 ms
/opt/matzruby/trunk/bin/ruby -v -rubygems file_read_parse-3.rb
ruby 1.9.2dev (2009-08-14 trunk 24539) [i686-linux]
ffi_c.so: warning: method redefined; discarding old inspect
struct.rb:26: warning: method redefined; discarding old offset
variadic.rb:15: warning: method redefined; discarding old call
library.rb:78: warning: method redefined; discarding old fopen
library.rb:78: warning: method redefined; discarding old fgetc
Starting to read file ...
It took 4565.077896 ms
The number of tokens is: 479623
jruby -v -rubygems file_read_parse.rb
jruby 1.3.0 (ruby 1.8.6p287) (2009-06-03 5dc2e22) (OpenJDK Client VM
1.6.0_0) [i386-java]
Starting to read file ...
The number of tokens is: 479623
It took 2316.0 ms
jruby -v -rubygems file_read_parse-3.rb
jruby 1.3.0 (ruby 1.8.6p287) (2009-06-03 5dc2e22) (OpenJDK Client VM
1.6.0_0) [i386-java]
Starting to read file ...
It took 3117.0 ms
The number of tokens is: 479623

And the code:
cat file_read_parse-3.rb
require 'ffi'

module LibC
extend FFI::Library

# FILE *fopen(const char *path, const char *mode);
attach_function :fopen, [ :string, :string ], :pointer

# int fgetc(FILE *stream);
attach_function :fgetc, [ :pointer ], :int
end

puts "Starting to read file ..."
start =3D Time.now

file =3D LibC.fopen("/tmp/file_test.txt", "r")
count =3D 0; in_word =3D false
while (c =3D LibC.fgetc(file)) !=3D -1
if 32 < c and c < 127
unless in_word
count +=3D 1
in_word =3D true
end
else
in_word =3D false
end
end

stop =3D Time.now
puts "It took #{(stop - start) * 1000} ms"
puts "The number of tokens is: #{count}"
 
M

Mike Sassak

[Note: parts of this message were removed to make it a legal post.]

And you can even, with another ounce of ruby-love, rewrite that as:

num = 0
ARGF.each do |l|
num += l.split.length
end

Then it also works with stdin or multiple filenames on the cmdline.

I'll leave it to others to #inject... ;)

Ha! I wrote it with inject initially, but then thought, "Nah... I don't want
to blow *too* many minds." :)
 
B

Ben Christensen

Thanks everyone for your responses.

Yes, this test is representative of some of the types of applications
and necessary data processing I have current applications doing and am
needing in some future ones.

The file I'm using is 49MB in size unzipped - too large for me to upload
right now as I'm on a mobile cell network.

To provide context on the file, it contains data such as this:

Western Digital Caviar Special Edition Hard Drive - 80GB - 7200rpm -
Ultra ATA - IDE/EIDE - Internal
Kingston 256MB SDRAM Memory Module - 256MB (1 x 256MB) - 133MHz PC133 -
SDRAM - 144-pin
512Mo (1 x 512Mo) - 133MHz PC133 - SDRAM - 168 broches

It's stats are:

wc /tmp/file_test.txt
1778983 7764115 51084191 /tmp/file_test.txt

This is not a test of "file reading". The test is related to the
performance of iterating over large lists of data and performing
processing on them - such as indexing for searching, cleansing,
normalizing etc.

This is a very small representation of the level of complexity and size
of data I would in reality be dealing with.

It seems however that the answer is that this is not what Ruby is well
suited for. Am I correct in that determination?

I will however be continuing my ongoing tests with SOAP/REST webservices
and more CRUD focused webapps, where I expect to see Ruby shine.
 
B

Ben Christensen

pharrington, in your response you stated:

"as the code that happens is neither the most "elegant" *nor* fastest
Ruby can do."

Can you please provide me a re-write of the Ruby code I used that is
elegant and fast so I can learn from you?

I consider myself quite advanced in Java (14 years of experience there)
but obviously do not have experience in Ruby for performance tuning and
optimization.

I would appreciate your demonstration of how to perform the task I have
attempted in Ruby using an appropriate "Ruby" approach that achieves the
highest performance possible and the "elegance" spoke of.

Thank you.

Ben
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

A version of Mike Sassak's gist

start = Time.now
printf "Starting to read file ...\nThe number of tokens is: %d.\nIt took
%.2f ms\n" , File.open(ARGV[0]){|f| f.inject(0){|a,l| a+l.split.length } } ,
(Time.now - start) * 1000

I won't call it elegant, that seems subjective to me, but I do appreciate
brevity.
 
M

Matthew K. Williams

This is not a test of "file reading". The test is related to the
performance of iterating over large lists of data and performing
processing on them - such as indexing for searching, cleansing,
normalizing etc.

This is a very small representation of the level of complexity and size
of data I would in reality be dealing with.

It seems however that the answer is that this is not what Ruby is well
suited for. Am I correct in that determination?

Ben -- I've been working with Java since '96 (and taught Java for sun for
a while, so I think I can understand where you may be coming from). At
this point, I prefer to write Ruby -- it's much more readable and lots
less *crufty* than Java, but Java still pays the bills.

I do have the following questions and/or things to consider --

1. How *often* are you going to be processing these files? If they are
batch style jobs, then does absolute speed matter over maintainability?

2. Are there any reasons to not keep the data in a database and then
perform queries, etc.?



If you're wanting to do things such as indexing and so forth, Ruby's
string handling far outshines, imho, Java's. Ruby's "collections" and
enumerables are far more robust as well. As a result, I can spend 5
minutes writing something that would take me 30 or even 60 minutes in
Java. Yes, ruby may not be faster in execution time -- of course, as the
results show, it depends on how you write it (in one instance it was
faster than java), but even if a run takes, say, 1 second longer, it'd
have to run 1500 times before the total of java's development and runtime
caught up with ruby's. And that's not including maintenance time. Then
factor in that developer time is usually far more expensive than cpu time,
and Ruby tends to come out in the lead.

What would be a far more fair assessment would be to factor in the amount
of time it takes to write a test, as well as the number of lines of code,
since size of code tends to increase complexity and also maintenance
costs. Then run the two and see which is better.

If you're processing these files in realtime to extract data, etc., then
perhaps you'd be better loading them into a database. However, if they're
batched, as I expect, by simply comparing "speed of execution" you're
looking at only one facet of the problem.

Matt
 
R

Robert Klemme


1.9* is significantly better. I did not try JRuby yet.

robert@fussel /cygdrive/c/Temp/frp
$ /cygdrive/c/Programme/Java/jdk1.6.0_14/bin/javac FileReadParse.java

robert@fussel /cygdrive/c/Temp/frp
$ java -cp . FileReadParse
Starting to read file...
The number of tokens is: 1122
It took 16 ms

robert@fussel /cygdrive/c/Temp/frp
$ allruby file_read_parse.rb
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
Starting to read file ...
The number of tokens is: 1122
It took 3.0 ms
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-cygwin]
Starting to read file ...
The number of tokens is: 1122
It took 2.0 ms

robert@fussel /cygdrive/c/Temp/frp
$ wc file_test.txt
190 1114 7579 file_test.txt

robert@fussel /cygdrive/c/Temp/frp
$


====================================================================


robert@fussel /cygdrive/c/Temp/frp
$ !w
wc file_test.txt x
95000 557000 3789500 file_test.txt
68970 404382 2751177 x
163970 961382 6540677 insgesamt

robert@fussel /cygdrive/c/Temp/frp
$ java -cp . FileReadParse
Starting to read file...
The number of tokens is: 561000
It took 359 ms

robert@fussel /cygdrive/c/Temp/frp
$ !a
allruby file_read_parse.rb
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
Starting to read file ...
The number of tokens is: 561000
It took 1395.0 ms
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-cygwin]
Starting to read file ...
The number of tokens is: 561000
It took 872.0 ms

robert@fussel /cygdrive/c/Temp/frp

robert@fussel /cygdrive/c/Temp/frp
$ /cygdrive/c/Programme/Java/jdk1.6.0_14/bin/java -server -cp .
FileReadParse
Starting to read file...
The number of tokens is: 561000
It took 515 ms

robert@fussel /cygdrive/c/Temp/frp
$

Cheers

robert
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

My previous version would probably be better like this:

start = Time.now
puts "Starting to read file ..."
puts "The number of tokens is: %d." % File.open(ARGV[0]){|f|
f.inject(0){|a,l| a+l.split.length } } ,
"It took #{(Time.now - start) * 1000} ms"

That way if the file is enormous, it prints the "starting to read file ..."
immediately.


A version of Mike Sassak's gist

start = Time.now
printf "Starting to read file ...\nThe number of tokens is: %d.\nIt took
%.2f ms\n" , File.open(ARGV[0]){|f| f.inject(0){|a,l| a+l.split.length } } ,
(Time.now - start) * 1000

I won't call it elegant, that seems subjective to me, but I do appreciate
brevity.



On Wed, Aug 19, 2009 at 12:52 PM, Ben Christensen <
 
R

Reid Thompson

Robert said:

1.9* is significantly better. I did not try JRuby yet.

robert@fussel /cygdrive/c/Temp/frp
$ /cygdrive/c/Programme/Java/jdk1.6.0_14/bin/javac FileReadParse.java

robert@fussel /cygdrive/c/Temp/frp
$ java -cp . FileReadParse
Starting to read file...
The number of tokens is: 1122
It took 16 ms

robert@fussel /cygdrive/c/Temp/frp
$ allruby file_read_parse.rb
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
Starting to read file ...
The number of tokens is: 1122
It took 3.0 ms
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-cygwin]
Starting to read file ...
The number of tokens is: 1122
It took 2.0 ms

robert@fussel /cygdrive/c/Temp/frp
$ wc file_test.txt
190 1114 7579 file_test.txt

robert@fussel /cygdrive/c/Temp/frp
$


====================================================================


robert@fussel /cygdrive/c/Temp/frp
$ !w
wc file_test.txt x
95000 557000 3789500 file_test.txt
68970 404382 2751177 x
163970 961382 6540677 insgesamt

robert@fussel /cygdrive/c/Temp/frp
$ java -cp . FileReadParse
Starting to read file...
The number of tokens is: 561000
It took 359 ms

robert@fussel /cygdrive/c/Temp/frp
$ !a
allruby file_read_parse.rb
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
Starting to read file ...
The number of tokens is: 561000
It took 1395.0 ms
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-cygwin]
Starting to read file ...
The number of tokens is: 561000
It took 872.0 ms

robert@fussel /cygdrive/c/Temp/frp

robert@fussel /cygdrive/c/Temp/frp
$ /cygdrive/c/Programme/Java/jdk1.6.0_14/bin/java -server -cp .
FileReadParse
Starting to read file...
The number of tokens is: 561000
It took 515 ms

robert@fussel /cygdrive/c/Temp/frp
$

Cheers

robert
$ java FileReadParse
Starting to read file...
The number of tokens is: 284717
It took 333 ms
rthompso@raker>~

$ ruby wcinline.rb uscities.txt
Starting to read file ...
284717
It took 211.72 ms
rthompso@raker>~

$ time wc uscities.txt
141989 284717 7449038 uscities.txt

real 0m0.333s
user 0m0.307s
sys 0m0.006s

$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) Server VM (build 14.1-b02, mixed mode)

$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-linux]

Not sure how Gentoo handles the java, but all other exes on the box are compiled
CFLAGS="-march=prescott -O2 -g -pipe" with splitdebug enabled
dual core
Linux raker 2.6.30-gentoo-r4 #2 SMP PREEMPT Wed Aug 5 11:51:00 EDT 2009 i686
Intel(R) Core(TM)2 CPU 6320 @ 1.86GHz GenuineIntel GNU/Linux


wcinline.rb quickly hacked from
http://en.literateprograms.org/Special:Downloadcode/Word_count_(C)
and
http://github.com/remogatto/ffi-inl...c0778e12218d3ffa83e3f823acaf/examples/ex_1.rb

$ cat wcinline.rb
require 'ffi-inliner'

module MyLib
extend Inliner
inline '#include <stdio.h>
#include<ctype.h>

int n;

void wc(const char *fname)
{
int ch;
int chars=0;
int words=0;
int lines=0;
int sp=1;
FILE *fp;

if(fname[0]!=055) fp=fopen(fname, "r");
else fp=stdin;
if(!fp) return -1;

while((ch=getc(fp))!=EOF) {
if(isspace(ch)) sp=1;
else if(sp) {
++words;
sp=0;
}
}

if(fname[0]!=055) fclose(fp);

printf("% 8d\n", words);
}'
end

class Foo
include MyLib
end

# get the start time
start = Time.now

puts "Starting to read file ..."

Foo.new.wc(ARGV[0])

puts "It took " + ((Time.now-start)*1000).to_s + " ms"
 
C

Charles Oliver Nutter

I've started down a path of doing various performance tests to see what
kind of impact will occur by using Ruby and in my first test the numbers
are very poor - so poor that I have to question if I'm doing something
wrong.

1.8.6 is pretty slow, compared to other impls. Ruby 1.9 and JRuby will
perform better, as shown by a few folks. JRuby on a Java 6 JVM with
--fast and --server should perform very well.

I'm also pretty confident that I can get JRuby within a few times Java
performance for non-numeric CPU-intensive tasks. Just not sure when it
will be a priority to make it happen.

- Charlie
 
B

brabuhr

1.8.6 is pretty slow, compared to other impls. Ruby 1.9 and JRuby will
perform better, as shown by a few folks. JRuby on a Java 6 JVM with
--fast and --server should perform very well.

And, of course JRuby adds other possibilities:

$ java FileReadParse
Starting to read file...
The number of tokens is: 234937
It took 2098 ms

$ java FileReadParse
Starting to read file...
The number of tokens is: 234937
It took 788 ms

$ ruby -v file_read_parse.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]
Starting to read file ...
The number of tokens is: 234937
It took 2666.646 ms

$ jruby -v file_read_parse.rb
jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM)
Client VM 1.5.0_16) [ppc-java]
Starting to read file ...
The number of tokens is: 234937
It took 3120.0 ms

$ jruby --fast --server -v file_read_parse.rb
jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM)
Client VM 1.5.0_16) [ppc-java]
Starting to read file ...
The number of tokens is: 234937
It took 2809.0 ms

$ jruby -v file_read_parse-2.rb
jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM)
Client VM 1.5.0_16) [ppc-java]
Starting to read file...
The number of tokens is: 234937
It took 593 ms

$ java FileReadParse
Starting to read file...
The number of tokens is: 234937
It took 588 ms

$ jruby -v file_read_parse-2.rb
jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM)
Client VM 1.5.0_16) [ppc-java]
Starting to read file...
The number of tokens is: 234937
It took 595 ms

$ cat file_read_parse-2.rb
require 'java'
java_import 'FileReadParse'

FileReadParse.new.do_stuff

:)
 
B

Ben Christensen

@Mike

Thank you for providing the Gist link to a file.
(http://gist.github.com/170476)

However, the changes don't improve the performance when I take into
account what was removed and I had in there on purpose. Take note of
item #2 below.

1) Object structure

The modified code removed all of the class/object structure, which I
purposefully had in there to simulate this being an object within a
larger project.

That being said, converting the lines of code we're discussing for
performance into a script means nothing to this discussion - but I
purposefully am writing the code in an OO style with classes as opposed
to scripts.

I was also purposefully making the Java and Ruby versions as similar to
each other so as to allow a performance comparison to be done with as
little difference as possible in approaching the code.

2) Counting versus Using the Tokens

In the modified code, it is now just counting the tokens:

num += l.split.length

Obviously that is faster than what I had in the original code. Again
however, I'm doing this on purpose.

Counting the number of tokens in an of itself is not all that I was
doing in the original code or in the Java version. To simulate more
closely what actually occurs in a functional system I am:

- assigning the array of tokens to a variable
- iterating the tokens to do something with each of them

In this case I'm just assigning each token to another variable and then
performing the count.

In a real world use I'd perform some function on the text, put it
somewhere, whatever.

This change accounts for the difference in time from "7965.289 ms" to
"4821.399 ms" when I run the original code and the modified code.

So yes, the modified code is "faster", but it's not doing the same thing
as the original and therefore not a valid comparison.


What I gather therefore from looking at your changes, is that there
really isn't anything different for me to do in the code - that I am in
fact using the proper API calls and techniques and there is nothing
special.

For example, in Java there are 2 ways of doing this:

a) String.split - which uses REGEX and is much slower as it's intended
for pattern matching, not simple tokenization
b) StringTokenizer - intended for tokenization on a delimiter instead of
REGEX and much faster

Therefore, I'm using option (b) in Java. I was curious if I was
mistakenly using a slower technique of Ruby when in fact there was a
faster alternative.
 
B

Ben Christensen

@Matthew K. Williams

-- 1. How *often* are you going to be processing these files? If they
are
-- batch style jobs, then does absolute speed matter over
maintainability?

The particular application I'm looking at in the future has a virtually
continuous feed of incoming data from multiple concurrent sources.

Thus I'm looking at what language the processing code would be in. My
default go to is Java - but I want to consider Ruby and not blindly just
use what I'm accustomed to before establishing what will likely be in
existence for the next 3-5 years.

In an existing system doing similar data processing, it is indeed a
batch process - but one that preferably didn't exist - thus the concept
of potentially doubling the time isn't appealing - as it's already a
thorn in the side of operations at which hardware is thrown to
alleviate.

In another system we horizontally cluster and shard data processing as
much as possible to parallelize the effort - and do as much as we can to
optimize performance. For example, daily jobs are required, but the
volume of data progressed to where the old system was taking days to
process a single job - hence the new system which now handles a job in
4-6 hours - and we're looking at other ways of reducing that further but
so far their cost exceeds business value for now.


-- 2. Are there any reasons to not keep the data in a database and then
-- perform queries, etc.?

SQL is far slower at handling this type of processing in most cases with
large volumes of data where the incremental inefficiencies of things
like REGEX and SQL really add up over 10s of millions of executions.

I have recently dealt with a large database (100+ GB) where to achieve
the necessary performance thresholds we finally had to revert to the use
of C to write UDFs in MySQL that could process the data efficiently
without needing to pull the data out of the database, process in Java
then re-insert, and therefore create huge IO burdens. It was an order of
magnitude or two faster using this approach rather than straight SQL
and/or pulling the data out to process externally.

This is a rare thing - this project was the first time I've ever had to
do that due to very unique needs of the project.

Generally however I have Java in asynchronous processes doing the data
processing and manipulation.

The analysis of Ruby performance doing these types of jobs was intended
to find what cost the adoption of Ruby would incur.

It appears that Ruby is not well suited to data processing type
applications from what I've seen and heard so far.

In another simple test I did where I was iterating over a large amount
of data, I was shocked at how poorly the Ruby implementation did. It
seems the looping itself was a very inefficient action in the Ruby
interpreter.

Hopefully this helps provide some context to my questions about Ruby in
regards to batch process of data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top