Benchmark segfault [Was: Array#inject to create a hash versus Hash[*array.collect{}.flatten] ]

M

Michal Suchanek

So, I was tinkering with ways to build a hash out of transforming an
array, knowing the standard/idiomatic

id_list = [:test1, :test2, :test3]
id_list.inject({}) { |a,e| a[e]=e.object_id ; a }

I also decided to try something like this:

Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten]

and further (attempt to) optimize it via
Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!]
and
Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!]

Running this via Benchmark#bmbm gives pretty interesting, and to me,
unexpected, results (on a 3.2 GHz P4, 1GB of RAM, FC5 with ruby 1.8.4)

require 'benchmark'
id_list = (1..1_000_000).to_a
Benchmark::bmbm do |x|
x.report("inject") { id_list.inject({}) { |a,e| a[e] = e.object_id ; a} }
x.report("non-bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten] }
x.report("bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!] }
x.report("two-bang") { Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!] }
end

Rehearsal --------------------------------------------
inject 16.083333 0.033333 16.116667 ( 9.670747)
non-bang 1657.050000 1.800000 1658.850000 (995.425642)
bang 1593.716667 0.016667 1593.733333 (956.334565)
two-bang 1604.816667 1.350000 1606.166667 (963.803356)
-------------------------------- total: 4874.866667sec

user system total real
inject 5.183333 0.000000 5.183333 ( 3.102379)
non-bang zsh: segmentation fault ruby

Ow?

Also, I just thought of a similar way to accomplish the same thing:

x.report("zip") { Hash[ *id_list.zip(id_list.collect {|e| e.object_id})] }

Array#collect! won't work right with this, of course, but it seems to
have equally-bad performance. Is Array#inject just optimized for this,
or something?

The reason why you are seeing this (performance as well as timing) is
most likely caused by the different approach. When you use #inject you
just create one copy of the Array (the Hash). When you use #collect you
create at least one additional copy of the large array plus a ton of two
element arrays. That's way less efficient considering memory usage and
GC. You'll probably see much different results if your input array is
much shorter (try with 10 or 100 elements).

I also get a segfault (with 1.8.5 on Gentoo):

ruby -w test.rb
Rehearsal --------------------------------------------
inject 7.210000 0.870000 8.080000 ( 13.011414)
non-bang 4620.790000 179.730000 4800.520000 (6029.976497)
bang 4586.140000 200.530000 4786.670000 (5970.267190)
two-bang 4599.560000 268.080000 4867.640000 (6035.687979)
------------------------------- total: 14462.910000sec

user system total real
inject 10.740000 1.990000 12.730000 ( 15.500874)
non-bang Segmentation fault
hramrach@hp-tc2110:11(0) 06120053 16]~ $ ruby -v
ruby 1.8.5 (2006-12-04 patchlevel 2) [i686-linux]

Of course, it would be better to test with 1.8.6 but it takes quite
some time to retest ;-)

Thanks

Michal
 
R

Robert Klemme

So, I was tinkering with ways to build a hash out of transforming an
array, knowing the standard/idiomatic

id_list = [:test1, :test2, :test3]
id_list.inject({}) { |a,e| a[e]=e.object_id ; a }

I also decided to try something like this:

Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten]

and further (attempt to) optimize it via
Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!]
and
Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!]

Running this via Benchmark#bmbm gives pretty interesting, and to me,
unexpected, results (on a 3.2 GHz P4, 1GB of RAM, FC5 with ruby 1.8.4)

require 'benchmark'
id_list = (1..1_000_000).to_a
Benchmark::bmbm do |x|
x.report("inject") { id_list.inject({}) { |a,e| a[e] = e.object_id ; a} }
x.report("non-bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten] }
x.report("bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!] }
x.report("two-bang") { Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!] }
end

Rehearsal --------------------------------------------
inject 16.083333 0.033333 16.116667 ( 9.670747)
non-bang 1657.050000 1.800000 1658.850000 (995.425642)
bang 1593.716667 0.016667 1593.733333 (956.334565)
two-bang 1604.816667 1.350000 1606.166667 (963.803356)
-------------------------------- total: 4874.866667sec

user system total real
inject 5.183333 0.000000 5.183333 ( 3.102379)
non-bang zsh: segmentation fault ruby

Ow?

Also, I just thought of a similar way to accomplish the same thing:

x.report("zip") { Hash[ *id_list.zip(id_list.collect {|e| e.object_id})] }

Array#collect! won't work right with this, of course, but it seems to
have equally-bad performance. Is Array#inject just optimized for this,
or something?

The reason why you are seeing this (performance as well as timing) is
most likely caused by the different approach. When you use #inject you
just create one copy of the Array (the Hash). When you use #collect you
create at least one additional copy of the large array plus a ton of two
element arrays. That's way less efficient considering memory usage and
GC. You'll probably see much different results if your input array is
much shorter (try with 10 or 100 elements).

I also get a segfault (with 1.8.5 on Gentoo):

ruby -w test.rb
Rehearsal --------------------------------------------
inject 7.210000 0.870000 8.080000 ( 13.011414)
non-bang 4620.790000 179.730000 4800.520000 (6029.976497)
bang 4586.140000 200.530000 4786.670000 (5970.267190)
two-bang 4599.560000 268.080000 4867.640000 (6035.687979)
------------------------------- total: 14462.910000sec

user system total real
inject 10.740000 1.990000 12.730000 ( 15.500874)
non-bang Segmentation fault
hramrach@hp-tc2110:11(0) 06120053 16]~ $ ruby -v
ruby 1.8.5 (2006-12-04 patchlevel 2) [i686-linux]

Of course, it would be better to test with 1.8.6 but it takes quite
some time to retest ;-)

Thanks

Michal

I believe it's the star operator:

13:10:22 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 1000000
13:10:35 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]

Aborted (core dumped)
13:10:39 [Temp]: ruby -e 'def f(*a) end; f(*Array.new(ARGV.shift.to_i))'
1000000
13:11:09 [Temp]: ruby -e 'def f(*a) end; f(*Array.new(ARGV.shift.to_i))'
10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]

Aborted (core dumped)
13:11:14 [Temp]: uname -a
CYGWIN_NT-5.2-WOW64 PDBXPWSRK38 1.5.24(0.156/4/2) 2007-01-31 10:57 i686
Cygwin
13:11:36 [Temp]: ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-cygwin]

Kind regards

robert
 
F

Florian Frank

Robert said:
I believe it's the star operator:

13:10:22 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 1000000
13:10:35 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]

Aborted (core dumped)
13:10:39 [Temp]: ruby -e 'def f(*a) end; f(*Array.new(ARGV.shift.to_i))'
1000000
13:11:09 [Temp]: ruby -e 'def f(*a) end; f(*Array.new(ARGV.shift.to_i))'
10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]

Aborted (core dumped)
13:11:14 [Temp]: uname -a
CYGWIN_NT-5.2-WOW64 PDBXPWSRK38 1.5.24(0.156/4/2) 2007-01-31 10:57 i686
Cygwin
13:11:36 [Temp]: ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-cygwin]

The reason for this problem is, that the stack is used to pass A LOT of
arguments to the Hash.[] method. If the stack size of the executed process isn't
big enough (check your resource limits), this can lead to crashes.
 
R

Robert Klemme

Robert said:
I believe it's the star operator:

13:10:22 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 1000000
13:10:35 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]

Aborted (core dumped)
13:10:39 [Temp]: ruby -e 'def f(*a) end; f(*Array.new(ARGV.shift.to_i))'
1000000
13:11:09 [Temp]: ruby -e 'def f(*a) end; f(*Array.new(ARGV.shift.to_i))'
10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]

Aborted (core dumped)
13:11:14 [Temp]: uname -a
CYGWIN_NT-5.2-WOW64 PDBXPWSRK38 1.5.24(0.156/4/2) 2007-01-31 10:57 i686
Cygwin
13:11:36 [Temp]: ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-cygwin]

The reason for this problem is, that the stack is used to pass A LOT of
arguments to the Hash.[] method. If the stack size of the executed process isn't
big enough (check your resource limits), this can lead to crashes.

That was what I was guessing. However, I haven't enough insight to
verify. There is at least one point that causes me doubt: the array can
be handled and stored like any other array, so it should be on the heap
(or it is implicitly moved there).

Kind regards

robert
 
M

Mauricio Fernandez

Robert said:
I believe it's the star operator:

13:10:22 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 1000000
13:10:35 [Temp]: ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 10000000
-e:1: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i386-cygwin]
[...]
The reason for this problem is, that the stack is used to pass A LOT of
arguments to the Hash.[] method. If the stack size of the executed process
isn't
big enough (check your resource limits), this can lead to crashes.

That was what I was guessing. However, I haven't enough insight to
verify. There is at least one point that causes me doubt: the array can
be handled and stored like any other array, so it should be on the heap
(or it is implicitly moved there).

$ ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 2091000
-e:1:in `[]': stack level too deep (SystemStackError)
from -e:1
$ ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 2092000
Segmentation fault
$ ulimit -s
8192
$ ulimit -s 16384
$ ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 2092000
$ ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 4184000
-e:1:in `[]': stack level too deep (SystemStackError)
from -e:1
$ ruby -e 'Hash[*Array.new(ARGV.shift.to_i) ]' 4190000
Segmentation fault
 
F

Florian Frank

Robert said:
That was what I was guessing. However, I haven't enough insight to
verify. There is at least one point that causes me doubt: the array can
be handled and stored like any other array, so it should be on the heap
(or it is implicitly moved there).

Yes, the implementation of this is in eval.c's rb_call0, after that the
arguments are passed along via the argc/argv function arguments. Ruby also tries
to throw a SystemStackError in rb_call0 like Mauricio has shown, but this isn't
guaranteed to work in all cases.
 
N

Nobuyoshi Nakada

Hi,

At Wed, 13 Jun 2007 00:50:30 +0900,
Florian Frank wrote in [ruby-talk:255348]:
Yes, the implementation of this is in eval.c's rb_call0, after that the
arguments are passed along via the argc/argv function arguments. Ruby also tries
to throw a SystemStackError in rb_call0 like Mauricio has shown, but this isn't
guaranteed to work in all cases.

Actually, the definition of TMP_ALLOC. Try replacing #ifdef
C_ALLOCA with #if 1.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top