Dominik said:
Here is my solution:
The result:
$ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample
real 0m30.493s
user 0m29.435s
sys 0m0.681s
on a Pentium M 1500MHz, 512MB RAM
$ time ruby Sampling.rb 5_000_000 1_000_000_000 > big_sample
real 5m20.404s
user 5m18.600s
sys 0m1.480s
on a 1.2 GHz Duron with 512 MB RAM.
The odd thing is, that I could not see where to optimise to get near the
below 1 minute border.
In the end I tried 5 approaches (one being inspired by Dominiks idea
with the Hash, one by Joost).
I wrote a simple class to hold the different methods I used, where
@count is the number of members to be reported, @limit is the upper
bound of the range. @numbers is an Array and @numberHash a - guess - Hash.
The obvious method (at least for me) was to insert a random number into
an Array, checking whether the Array did not include the number and
redo-ing if it was already included.
def generate_with_array_include
@numbers.clear
@count.times do
newNumber = rand(@limit)
if @numbers.include?(newNumber)
redo
else
@numbers << newNumber
end
end
end
A different way to do the same is
def generate_with_uniq_first
@numbers.clear
@count.times do
@numbers << rand(@limit)
end
@numbers.uniq!
while @numbers.length < @count
@numbers << rand(@limit)
@numbers.uniq!
end
end
The idea behind this is, that with a big difference between @count and
@limit there is a slight chance, that no duplicate numbers will be
returned by Kernel#rand. That needs to be checked - so call Array#uniq!
to delete the duplicates. After that simply loop until the Array is full
(means, it has the required number of elements), calling uniq! in each
iteration. That performed better than 'generate_with_array_include' (30
times faster).
def generate_with_uniq_second
@numbers.clear
while @numbers.length < @count
(@count - @numbers.length).times do
@numbers << rand(@limit)
end
@numbers.uniq!
end
end
This is even faster (takes only half the time as
'generate_with_uniq_first') and looks better to me.
The 4th method uses the Hash approach Dominik chose.
def generate_with_hash_has_key
@numberHash.clear
@count.times do
newNumber = rand(@limit)
if @numberHash.has_key?(newNumber)
redo
else
@numberHash[newNumber] = nil
end
end
@numbers = @numberHash.keys
end
This is an adaptation to using hash instead of Array from my initial
algorithm.
This method is slightly faster than my second try with uniq! - with
@count being 5_000 and @limit 1_000_000. With bigger numbers the
difference grows.
The last line stores the keys in the @numbers-Array, so I didn't need to
change my output method - which looks like that:
def to_s
@numbers.sort.join("\n")
end
The 5th method is inspired by Joosts Hash#length idea. After I read it I
thought "This is so simple - why did you spend half the day thinking
about how to reach their 'below 1 minute' timings?".
def generate_with_hash_length
while @numberHash.length < @count
@numberHash[rand(@limit)] = nil
end
@numbers = @numberHash.keys
end
And this is the winner (from what I tried).
Here are some timings (without output) from a smaller sample (look at
'generate_with_array_include', I really didn't want to use this with
bigger numbers)
ruby -v Sampling.rb 50_000 10_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
user system total real
generate_with_uniq_first 17.790000 0.020000 17.810000 ( 17.806102)
generate_with_uniq_second 0.950000 0.010000 0.960000 ( 0.956298)
generate_with_array_include 243.300000 0.090000 243.390000 (243.969158)
generate_with_hash_has_key 0.310000 0.000000 0.310000 ( 0.308837)
generate_with_hash_length 0.020000 0.000000 0.020000 ( 0.019657)
And here some timings (without output as well) from the top3 methods
(using the original parameters)
ruby -v Sampling.rb 5_000_000 1_000_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
user system total real
generate_with_uniq_second 95.240000 1.070000 96.310000 ( 96.511207)
generate_with_hash_has_key 63.570000 0.970000 64.540000 ( 64.641186)
generate_with_hash_length 5.720000 0.000000 5.720000 ( 5.718732)
So, even my best method using Array and uniq! turned out to be not
suited. Well, at the end I discovered the bmbm method of Benchmark. It
gives different results for the second run:
ruby -v Sampling.rb 5_000_000 1_000_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
Rehearsal --------------------------------------------------------------
generate_with_uniq_second 95.640000 0.850000 96.490000 ( 96.563648)
generate_with_hash_has_key 63.280000 1.270000 64.550000 ( 64.687427)
generate_with_hash_length 5.720000 0.000000 5.720000 ( 5.718150)
--------------------------------------------------- total: 166.760000sec
user system total real
generate_with_uniq_second 99.790000 0.740000 100.530000 (100.740020)
generate_with_hash_has_key 45.050000 0.830000 45.880000 ( 45.987040)
generate_with_hash_length 3.130000 0.000000 3.130000 ( 3.132459)
I would have never thought about using a Hash in this quiz. I really do
not need the value of any stored key. It is an Array-problem. List of
numbers. Sorted. (unique)
Why use a Hash?
Some of you did use it. Do you mind telling me why?
Anyway, it was a great quiz.
Final timings, now using 'generate_with_hash_length':
time ruby -d Sampling.rb 5_000_000 1_000_000_000 > big_sample_new
real 1m17.537s
user 0m58.200s
sys 0m1.180s
Stefan Mahlitz