ruby performance

N

Nan Li

Hello,
I am relatively new to both ruby and perl. I like a lot about ruby.
But I found ruby is about 5 - 8 times slower than perl when it comes
to large text processing. I don't know if this is a well known fact or
it just happens to me.

Thanks,
Nan
 
R

Robert Klemme

Nan said:
Hello,
I am relatively new to both ruby and perl. I like a lot about ruby.
But I found ruby is about 5 - 8 times slower than perl when it comes
to large text processing. I don't know if this is a well known fact or
it just happens to me.

It's known to be slower although I'd doubt the factor you mentioned.
What piece of code did you benchmark?

Kind regards

robert
 
K

Kenosis

I concur. Please post your code so we can have a look. There are few
key got-cha's you need to look out for. Also, you could try re-bench
marking with YARV to see if that makes any significant difference in
your case.

Ken
 
N

Nan Li

I concur. Please post your code so we can have a look. There are few
key got-cha's you need to look out for. Also, you could try re-bench
marking with YARV to see if that makes any significant difference in
your case.

Ken

Here is how I did my test:

I have 3 files:
1) genLog.pl

my $key = 'Start Start Start Start';
my @s = ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz' );

for ( $i =0; $i < 1024 * 1024; $i++ ) {
print $key, "\n";
foreach ( @s ) {
print $_, "\n";
}
}

2) test.pl
my $log = 'log';

my @block = ();

open( FD, $log );

while( <FD> ) {
chomp;
if ( m/Start Start Start Start/ ) {
push @block, $_;
}
}

print scalar @block, "\n";

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
block << line
end
}
}

puts block.size

I used genLog.pl to generate a large text file, and then time test.pl
and test.rb
My test ran as belows:

[nan@athena test]$ perl genLog.pl > log
[nan@athena test]$ ls -lh log
-rw-rw-r-- 1 nan nan 78M Jun 27 00:25 log
[nan@athena test]$ time perl test.pl
1048576

real 0m3.469s
user 0m3.252s
sys 0m0.136s
[nan@athena test]$ time ruby test.rb
1048576

real 0m18.775s
user 0m16.525s
sys 0m0.336s

ruby program is about 6 times slower. The above 2 scripts use the
same language constructs, the same algorithm. The problem lies in the
language itself or the implementation of the language.
 
R

Robert Klemme

Nan said:
Here is how I did my test:

I have 3 files:
1) genLog.pl

my $key = 'Start Start Start Start';
my @s = ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz' );

for ( $i =0; $i < 1024 * 1024; $i++ ) {
print $key, "\n";
foreach ( @s ) {
print $_, "\n";
}
}

2) test.pl
my $log = 'log';

my @block = ();

open( FD, $log );

while( <FD> ) {
chomp;
if ( m/Start Start Start Start/ ) {
push @block, $_;
}
}

print scalar @block, "\n";

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then

Revesing the RX and the string is usually more efficient.
block << line
end
}
}

puts block.size

I used genLog.pl to generate a large text file, and then time test.pl
and test.rb
My test ran as belows:

[nan@athena test]$ perl genLog.pl > log
[nan@athena test]$ ls -lh log
-rw-rw-r-- 1 nan nan 78M Jun 27 00:25 log
[nan@athena test]$ time perl test.pl
1048576

real 0m3.469s
user 0m3.252s
sys 0m0.136s
[nan@athena test]$ time ruby test.rb
1048576

real 0m18.775s
user 0m16.525s
sys 0m0.336s

ruby program is about 6 times slower.

I see only a factor of 5 but anyway, that's still too much. Did you do
just a single run or did you run your scripts at least several times to
get statistical valid data? If not, I suggest you do each test 10 times
and see what happens.
The above 2 scripts use the
same language constructs, the same algorithm. The problem lies in the
language itself or the implementation of the language.

One difference is that you don't close the IO handle properly in the
perl script. OTOH this test is quite artificial. If you just wanted to
count those lines a simple scalar would have sufficed.

Kind regards

robert
 
M

Martin DeMello

Robert Klemme said:
3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then

Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

martin
 
R

Robert Klemme

Martin said:
Robert Klemme said:
3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert
 
K

Kenosis

Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA II.
Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk I/O
of mine when sequentially accessing a file???

ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Ken


Robert said:
Martin said:
Robert Klemme said:
3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert
 
K

Kenosis

And, not that it's practical in all cases but reading the file into
memory w/IO.readlines and then processing the result w/the block
provided cuts the time down to:

time ruby test4.rb
1048576
6.368u 0.807s 0:07.19 99.5% 0+0k 0+0io 0pf+0w

Seems like File.each_line has some issue?

Ken
Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA II.
Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk I/O
of mine when sequentially accessing a file???

ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Ken


Robert said:
Martin said:
3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert
 
R

Robert Klemme

Kenosis said:
Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA
II. Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Yes, that's generally so.
Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk
I/O of mine when sequentially accessing a file???

No, because each_line is called once but invokes the block multiple times.
This is not just the IO read time.
ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Cheers

robert
 
R

Robert Klemme

Kenosis said:
And, not that it's practical in all cases but reading the file into
memory w/IO.readlines and then processing the result w/the block
provided cuts the time down to:

time ruby test4.rb
1048576
6.368u 0.807s 0:07.19 99.5% 0+0k 0+0io 0pf+0w

Seems like File.each_line has some issue?

Hm.... I don't think so. You should repeat your tests several times in
order to get meaningful results. In an application like this (i.e. all
files are read but not all need to be stored in mem) I would use the
each_line or File.foreach approach regardless of your benchmarks because
this scales better with regard to file size. You cannot slurp a 10GB file
into mem on a 32bit system but you can crunch it away line by line.

Kind regards

robert
 
K

Kenosis

I totally agree. In a windows environment the standard for getting
good stats is 50 runs because what windows and other services do in the
back ground is so totally unpredictable. I just ran my test that loads
the file into memory 15 times back to back w/a big compile going on my
PC and its consistently 6.5 or so seconds, thanks to the dual core and
likely due to the file being cached. However the build I'm running is
HUGE so perhaps its not totally cached.

As for scalability, it might be reasonable to check the file's size and
if reasonable, load it all and process it - if too large, like your
10GB "what if", then revert to File.each_line, thus getting the best of
both worlds: quick load for smaller files, and robust handling of
arbitrarily large files. Simply a design and code complexity trade
off, it would seem to me :)

Respectfully,

Ken
 
R

Robert Klemme

Kenosis said:
I totally agree. In a windows environment the standard for getting
good stats is 50 runs because what windows and other services do in the
back ground is so totally unpredictable. I just ran my test that loads
the file into memory 15 times back to back w/a big compile going on my
PC and its consistently 6.5 or so seconds, thanks to the dual core and
likely due to the file being cached. However the build I'm running is
HUGE so perhaps its not totally cached.

As for scalability, it might be reasonable to check the file's size and
if reasonable, load it all and process it - if too large, like your
10GB "what if", then revert to File.each_line, thus getting the best of
both worlds: quick load for smaller files, and robust handling of
arbitrarily large files. Simply a design and code complexity trade
off, it would seem to me :)

Yuck. Although I usually don't bother to invest that much because my
scripts are usually not that time critical.

Kind regards

robert


PS: please don't top post.
 
M

Minkoo Seo

Hi mathew.

Personally, I've tested IO.readlines and C++ version of file reading
and found that Ruby is quite slow even in simple file I/O. This might
be due to Ruby interpreter or some other overheads of Ruby.

But, in some respect, SW performance is much more than just running
time. For example, code readability/writability, maintainability,
security and so many aspects count in performance.

So, even Ruby is slow, I don't mind, because writing C++ code requires
much more hassle than the case of Ruby.

Sincerely,
Minkoo Seo
 
R

Robert Klemme

Minkoo said:
Hi mathew.

Personally, I've tested IO.readlines and C++ version of file reading
and found that Ruby is quite slow even in simple file I/O. This might
be due to Ruby interpreter or some other overheads of Ruby.

Care to post the code?
But, in some respect, SW performance is much more than just running
time. For example, code readability/writability, maintainability,
security and so many aspects count in performance.

So, even Ruby is slow, I don't mind, because writing C++ code requires
much more hassle than the case of Ruby.

Definitively!

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,206
Messages
2,571,075
Members
47,681
Latest member
hanrywillsonnn

Latest Threads

Top