ruby performance

Nan Li · Jun 26, 2006

Hello,
I am relatively new to both ruby and perl. I like a lot about ruby.
But I found ruby is about 5 - 8 times slower than perl when it comes
to large text processing. I don't know if this is a well known fact or
it just happens to me.

Thanks,
Nan

Robert Klemme · Jun 26, 2006

Nan said:
Hello,
I am relatively new to both ruby and perl. I like a lot about ruby.
But I found ruby is about 5 - 8 times slower than perl when it comes
to large text processing. I don't know if this is a well known fact or
it just happens to me.

It's known to be slower although I'd doubt the factor you mentioned.
What piece of code did you benchmark?

Kind regards

robert

Kenosis · Jun 26, 2006

I concur. Please post your code so we can have a look. There are few
key got-cha's you need to look out for. Also, you could try re-bench
marking with YARV to see if that makes any significant difference in
your case.

Ken

Nan Li · Jun 27, 2006

I concur. Please post your code so we can have a look. There are few
key got-cha's you need to look out for. Also, you could try re-bench
marking with YARV to see if that makes any significant difference in
your case.

Ken

Here is how I did my test:

I have 3 files:
1) genLog.pl

my $key = 'Start Start Start Start';
my @s = ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz' );

for ( $i =0; $i < 1024 * 1024; $i++ ) {
print $key, "\n";
foreach ( @s ) {
print $_, "\n";
}
}

2) test.pl
my $log = 'log';

my @block = ();

open( FD, $log );

while( <FD> ) {
chomp;
if ( m/Start Start Start Start/ ) {
push @block, $_;
}
}

print scalar @block, "\n";

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
block << line
end
}
}

puts block.size

I used genLog.pl to generate a large text file, and then time test.pl
and test.rb
My test ran as belows:

[nan@athena test]$ perl genLog.pl > log
[nan@athena test]$ ls -lh log
-rw-rw-r-- 1 nan nan 78M Jun 27 00:25 log
[nan@athena test]$ time perl test.pl
1048576

real 0m3.469s
user 0m3.252s
sys 0m0.136s
[nan@athena test]$ time ruby test.rb
1048576

real 0m18.775s
user 0m16.525s
sys 0m0.336s

ruby program is about 6 times slower. The above 2 scripts use the
same language constructs, the same algorithm. The problem lies in the
language itself or the implementation of the language.

Robert Klemme · Jun 27, 2006

Nan said:
Here is how I did my test:

I have 3 files:
1) genLog.pl

my $key = 'Start Start Start Start';
my @s = ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz' );

for ( $i =0; $i < 1024 * 1024; $i++ ) {
print $key, "\n";
foreach ( @s ) {
print $_, "\n";
}
}

2) test.pl
my $log = 'log';

my @block = ();

open( FD, $log );

while( <FD> ) {
chomp;
if ( m/Start Start Start Start/ ) {
push @block, $_;
}
}

print scalar @block, "\n";

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then

Revesing the RX and the string is usually more efficient.

block << line
end
}
}

puts block.size

I used genLog.pl to generate a large text file, and then time test.pl
and test.rb
My test ran as belows:

[nan@athena test]$ perl genLog.pl > log
[nan@athena test]$ ls -lh log
-rw-rw-r-- 1 nan nan 78M Jun 27 00:25 log
[nan@athena test]$ time perl test.pl
1048576

real 0m3.469s
user 0m3.252s
sys 0m0.136s
[nan@athena test]$ time ruby test.rb
1048576

real 0m18.775s
user 0m16.525s
sys 0m0.336s

ruby program is about 6 times slower.

I see only a factor of 5 but anyway, that's still too much. Did you do
just a single run or did you run your scripts at least several times to
get statistical valid data? If not, I suggest you do each test 10 times
and see what happens.

The above 2 scripts use the
same language constructs, the same algorithm. The problem lies in the
language itself or the implementation of the language.

One difference is that you don't close the IO handle properly in the
perl script. OTOH this test is quite artificial. If you just wanted to
count those lines a simple scalar would have sufficed.

Kind regards

robert

Martin DeMello · Jun 27, 2006

Robert Klemme said:
3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then

Click to expand...

Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

martin

Robert Klemme · Jun 27, 2006

Martin said:
Robert Klemme said:

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then

Click to expand...

Revesing the RX and the string is usually more efficient.

Click to expand...

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert

Kenosis · Jun 27, 2006

Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA II.
Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk I/O
of mine when sequentially accessing a file???

ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Ken

Robert said:
Martin said:

Robert Klemme said:

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
Revesing the RX and the string is usually more efficient.

Click to expand...

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

Click to expand...

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert

Kenosis · Jun 27, 2006

And, not that it's practical in all cases but reading the file into
memory w/IO.readlines and then processing the result w/the block
provided cuts the time down to:

time ruby test4.rb
1048576
6.368u 0.807s 0:07.19 99.5% 0+0k 0+0io 0pf+0w

Seems like File.each_line has some issue?

Ken

Robert said:
Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA II.
Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk I/O
of mine when sequentially accessing a file???

ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Ken

Robert said:

Martin said:

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

Click to expand...

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert

Click to expand...

Robert Klemme · Jun 27, 2006

Kenosis said:
Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA
II. Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Yes, that's generally so.

Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk
I/O of mine when sequentially accessing a file???

No, because each_line is called once but invokes the block multiple times.
This is not just the IO read time.

ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Cheers

robert

Robert Klemme · Jun 27, 2006

Kenosis said:
And, not that it's practical in all cases but reading the file into
memory w/IO.readlines and then processing the result w/the block
provided cuts the time down to:

time ruby test4.rb
1048576
6.368u 0.807s 0:07.19 99.5% 0+0k 0+0io 0pf+0w

Seems like File.each_line has some issue?

Hm.... I don't think so. You should repeat your tests several times in
order to get meaningful results. In an application like this (i.e. all
files are read but not all need to be stored in mem) I would use the
each_line or File.foreach approach regardless of your benchmarks because
this scales better with regard to file size. You cannot slurp a 10GB file
into mem on a 32bit system but you can crunch it away line by line.

Kind regards

robert

Kenosis · Jun 27, 2006

I totally agree. In a windows environment the standard for getting
good stats is 50 runs because what windows and other services do in the
back ground is so totally unpredictable. I just ran my test that loads
the file into memory 15 times back to back w/a big compile going on my
PC and its consistently 6.5 or so seconds, thanks to the dual core and
likely due to the file being cached. However the build I'm running is
HUGE so perhaps its not totally cached.

As for scalability, it might be reasonable to check the file's size and
if reasonable, load it all and process it - if too large, like your
10GB "what if", then revert to File.each_line, thus getting the best of
both worlds: quick load for smaller files, and robust handling of
arbitrarily large files. Simply a design and code complexity trade
off, it would seem to me

Respectfully,

Ken

Robert Klemme · Jun 28, 2006

Kenosis said:
I totally agree. In a windows environment the standard for getting
good stats is 50 runs because what windows and other services do in the
back ground is so totally unpredictable. I just ran my test that loads
the file into memory 15 times back to back w/a big compile going on my
PC and its consistently 6.5 or so seconds, thanks to the dual core and
likely due to the file being cached. However the build I'm running is
HUGE so perhaps its not totally cached.

As for scalability, it might be reasonable to check the file's size and
if reasonable, load it all and process it - if too large, like your
10GB "what if", then revert to File.each_line, thus getting the best of
both worlds: quick load for smaller files, and robust handling of
arbitrarily large files. Simply a design and code complexity trade
off, it would seem to me

Yuck. Although I usually don't bother to invest that much because my
scripts are usually not that time critical.

Kind regards

robert

PS: please don't top post.

Minkoo Seo · Jul 1, 2006

Hi mathew.

Personally, I've tested IO.readlines and C++ version of file reading
and found that Ruby is quite slow even in simple file I/O. This might
be due to Ruby interpreter or some other overheads of Ruby.

But, in some respect, SW performance is much more than just running
time. For example, code readability/writability, maintainability,
security and so many aspects count in performance.

So, even Ruby is slow, I don't mind, because writing C++ code requires
much more hassle than the case of Ruby.

Sincerely,
Minkoo Seo

Robert Klemme · Jul 1, 2006

Minkoo said:
Hi mathew.

Personally, I've tested IO.readlines and C++ version of file reading
and found that Ruby is quite slow even in simple file I/O. This might
be due to Ruby interpreter or some other overheads of Ruby.

Care to post the code?

But, in some respect, SW performance is much more than just running
time. For example, code readability/writability, maintainability,
security and so many aspects count in performance.

So, even Ruby is slow, I don't mind, because writing C++ code requires
much more hassle than the case of Ruby.

Definitively!

Kind regards

robert

Ruby performance	9	Sep 28, 2007
Ruby vs Perl performance	62	Feb 7, 2009
JRuby performance questions answered	16	Oct 31, 2007
Ruby Performance	86	Aug 12, 2005
Two Advanced Ruby Performance Questions	26	Nov 26, 2006
Ruby performance tips & tricks	2	Feb 9, 2009
Ruby Script Under performance in CYGWIN environment	3	Jan 28, 2008
Ruby performance	7	Mar 22, 2006

ruby performance

Nan Li

Robert Klemme

Kenosis

Nan Li

Robert Klemme

Martin DeMello

Robert Klemme

Kenosis

Kenosis

Robert Klemme

Robert Klemme

Kenosis

Robert Klemme

Minkoo Seo

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads