And so a test:
(REGEX) a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq
vs.
(INDEX) a.select{|e| a.index(e) !=3D a.rindex(e)}.uniq
I ran these on an array with 11,000 elements, 2,000 of which were
duplicates, as follows:
(1..10000).to_a + (2000..3000).to_a
The scan (REGEX) script finished in 0.21 seconds
The (INDEX) script ran in 28.77 seconds
The regex was over 100X faster!
Thanks for this. I was wondering if my method of benchmarking might be
flawed, and it sort of is in this case. But it makes sense here
because the scan just has to make one run through the string, whereas
each iteration of the select has multiple calls to an index which
might potentially traverse the entire array. As my benchmark shows
though, a lot of searches in a smaller array is faster using the
index.
Also another problem with the string scanning is that it depends on
the array elements being numbers (at least in the case above.)
I decided to benchmark again using your method and adding some more
implementations. Here we go (beware of wrapped lines):
require 'benchmark'
include Benchmark
[[[0,1,2,3,4,5,2,3], 50000],[(1..5000).to_a + (2000..2500).to_a,
1]].each do |a, iterations|
bm(12) do |bx|
bx.report('simonk') { iterations.times { a.select{|e| a.index(e)
!=3D a.rindex(e)}.uniq } }
bx.report('samk ') { iterations.times { a.select{|e|
a.grep(e).length > 1}.uniq } }
bx.report('paul ') { iterations.times { a.sort.join('
').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } }
bx.report('jeff ') { iterations.times { a.uniq.inject(a.dup) {
|b,i| b.delete_at(b.index(i)) ; b } } }
bx.report('paolo ') { iterations.times { h =3D {};u =3D a.inject([])
{|res, x| h[x] ? res + [x] : h[x] =3D res}.uniq } }
bx.report('jegII ') { iterations.times { seen =3D
Hash.new(0);a.select { |e| (seen[e] +=3D 1) > 1 }.uniq } }
bx.report('simons') { iterations.times { uary =3D a.uniq;
a.map.delete_if{|i| x=3Duary.member?(i); uary.delete(i); x} } }
bx.report('robert') { iterations.times { a.inject(Hash.new(0))
{|h,e| h[e]+=3D1;h}.select {|k,v|v>1}.map {|x,y|x} } }
bx.report('david ') { iterations.times { a.uniq.find_all {|x|
a.find_all {|y| y =3D=3D x }.size > 1 } } }
bx.report('zach ') { iterations.times { t=3D[]; a.delete_if{ |e|
r=3D(not t.include? e); t.push(e); r }.uniq } }
bx.report('markvh') { iterations.times { t=3D[]; a.select{ |e|
r=3Dt.include?e; t<<e; r }.uniq } }
bx.report('martin') { iterations.times { h1 =3D {};h2 =3D {};a.each
{|i| h2
=3D true if h1;h1 =3D true };h2.keys } }
end
end
The results:
Small array, lots of iterations:
user system total real
simonk 1.344000 0.000000 1.344000 ( 1.344000)
samk 3.859000 0.000000 3.859000 ( 3.876000)
paul 3.532000 0.000000 3.532000 ( 3.547000)
jeff 1.640000 0.000000 1.640000 ( 1.641000)
paolo 2.656000 0.000000 2.656000 ( 2.672000)
jegII 1.704000 0.016000 1.720000 ( 1.719000)
simons 2.109000 0.000000 2.109000 ( 2.110000)
robert 3.172000 0.015000 3.187000 ( 3.203000)
david 5.078000 0.000000 5.078000 ( 5.079000)
zach 0.328000 0.000000 0.328000 ( 0.328000)
markvh 0.391000 0.000000 0.391000 ( 0.391000)
martin 0.390000 0.000000 0.390000 ( 0.484000)
Big array, one iteration:
user system total real
simonk 4.125000 0.000000 4.125000 ( 4.453000)
samk 17.125000 0.000000 17.125000 ( 17.283000)
paul 0.063000 0.000000 0.063000 ( 0.062000)
jeff 0.109000 0.000000 0.109000 ( 0.110000)
paolo 0.031000 0.000000 0.031000 ( 0.031000)
jegII 0.016000 0.000000 0.016000 ( 0.016000)
simons 2.219000 0.000000 2.219000 ( 2.265000)
robert 0.031000 0.000000 0.031000 ( 0.032000)
david 34.188000 0.016000 34.204000 ( 35.127000)
zach 2.156000 0.000000 2.156000 ( 2.203000)
markvh 0.015000 0.000000 0.015000 ( 0.016000)
martin 0.000000 0.000000 0.000000 ( 0.000000)
Overall Martin's was the best. As he suspected, David's was quite
slow. Sam's as well.
But it is interesting to see the varying results for each algorithm
based on how they are used.
Ryan