[QUIZ] Sampling (#39)

Paolo Capriotti · Jul 17, 2005

Unless I'm mistaken, in the 5e6-from-1e9 sampling, the probability

that a sampling contains exactly one number from a given 200-numbers
interval is 200.0*(1/200)*(199.0/200)**199 =3D 0.3688. The probability
that this happens for 20 such 200-numbers intervals is
0.3688**20 =3D 2.1e-09.

I think you are wrong. A rough estimates of this probability gives me
1e-2171438.
Actually, it should be a bit smaller, but at least the order of
magnitude of the order of magnitude is right

Quite impressive...

Paolo

Kero · Jul 17, 2005

PS: Just for the record, if someone implemented a simplified approach which

I feel the exact opposite, as it turns out.

Often times, I find it more interesting when people find the more creative
solution. Clearly here, the more creative solution is orders of magnitude
faster than the "original sought for answer". Sure, this is a quiz, and you
want people to do something specific, but I also look at these kinds of
things as a way to flex my mental muscles.

I think it is a lovely way to catch people not implementing
specifications. Plenty of people that don't understand specifications,
but implement "something" anyway. Including their own tests. Go on,
point them to their faults. You'll have to find the fault first...
It usually gets worse when threading or random numbers are involved.

This quiz is being useful in more aspects then I could ever have dreamed

Which isn't to say I don't like "creative solutions". I work at a
research department, I make my living from them

Sometimes, the "best" solution
_is_ the fastest solution, and I for one don't think you should fault a
solution that was within the confines of the problem but didn't exactly
mimic the answer the question poser had in mind.

I think we got a clear definition of random by now. I doubt this
'binning' solution works when MEMBERS approaches LIMIT

(if 2 out of 3 always includes member 1 (or 3), you would get suspicous
after a while, wouldn't you?).

+--- Kero ------------------------- kero@chello@nl ---+
| all the meaningless and empty words I spoke |
| Promises -- The Cranberries |
+--- M38c --- http://members.chello.nl/k.vangelder ---+

ruby quiz · Jul 17, 2005

=20
I think you are wrong. A rough estimates of this
probability gives me
1e-2171438.
Actually, it should be a bit smaller, but at least
the order of
magnitude of the order of magnitude is right
Quite impressive...
=20
Paolo
=20
=20

I think he is wrong too (sorry. Missed the previous
post, so I don't know who it was) but I think I think
it's wrong for a different reason.

I got the probability of getting exactly one number in
a specific interval to be as follows:
Probability of one *specific* number in the interval
1/20
Probability of all the other 19 OUT of that interval
(19/20)**19
But we have 20 specific numbers to consider, so:
20 * 1/20 * (19/20)**19
or
(19.0/20)**19
which gives a remarkably similar 0.3773536 (don't know
if this is a co-incidence. My brain hurts

Where I think this goes wrong is in assuming that
expanding this to 20 intervals just needs a **20.=20

After we get exactly one number in the first interval,
we then are looking for exactly one number in the
second interval, which is only one interval out of 19.
In other words, the probability should
be(18.0/19)**18. and so forth. The overall probability
therefore is
(19.0/20)**19 * (18.0/19)**18 * ... * (2.0/1)**1
which my RubyShell informs me is 0.377353602535307e-8,
or thereabouts

Graeme

Send instant messages to your online friends http://uk.messenger.yahoo.co=
m=20

Mauricio Fernández · Jul 17, 2005

After we get exactly one number in the first interval,
we then are looking for exactly one number in the
second interval, which is only one interval out of 19.
In other words, the probability should
be(18.0/19)**18. and so forth. The overall probability
therefore is
(19.0/20)**19 * (18.0/19)**18 * ... * (2.0/1)**1
which my RubyShell informs me is 0.377353602535307e-8,
or thereabouts

I believed it was

PI(i*200, i=1..20) / 4000^20 = PI(i, i=1..20) / 20^20
=> 2.32019615953125e-08

Wybo Dekker · Jul 17, 2005

My final: < 27s, 7 lines, filesize average approaches 49_444_444

$ time ruby sample1 5_000_000 1_000_000_000 >t; wc -l sample1; wc -c t
elap 26.503 user 25.930 syst 0.542 CPU 99.88%
7 sample1
49444971 t

Or, chaining some operations: < 28s, 3 lines:

$ time ruby sample2 5_000_000 1_000_000_000 >t; wc -l sample2; wc -c t
elap 27.780 user 27.181 syst 0.592 CPU 99.97%
3 sample2
49445003 t

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 9
cpu MHz : 2395.242
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 4734.97

Paolo Capriotti · Jul 17, 2005

=20
I got the probability of getting exactly one number in
a specific interval to be as follows:
Probability of one *specific* number in the interval
1/20
Probability of all the other 19 OUT of that interval
(19/20)**19
But we have 20 specific numbers to consider, so:
20 * 1/20 * (19/20)**19
or
(19.0/20)**19
which gives a remarkably similar 0.3773536 (don't know
if this is a co-incidence. My brain hurts

When multiplying probabilities, one has to be careful about event
independence. In this particular case (hypergeometric distribution),
it turns out that the best way to compute this probability is to use
the very definition of uniform distribution of samplings:
probability =3D (# of samplings satisfying this property) / (total # of
samplings).
Since these numbers are huge, it is better to compute their logarithms
(and use stirling approximation to estimate binomial coefficients),
obtaining 1e-10875 as an upper bound.
The result that I posted before was instead the probability that _all_
numbers fall in each of the 200-numbers intervals (sorry for that, I
had misunderstood the question).
Here is the exact probability:
200**20 * binom(10**9 - 4000, 5*10**6 - 20) / binom(10**9, 5*10**6).

Paolo

Dominik Bathon · Jul 17, 2005

Here is my solution:

The result:

$ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample

real 0m30.493s
user 0m29.435s
sys 0m0.681s
$ ll big_sample
-rw-r--r-- 1 dba users 49443878 17. Jul 11:29 big_sample
$ wc -l big_sample
5000000 big_sample
$ uniq big_sample | wc -l
5000000
$ head big_sample
624
756
1293
1506
1607
1627
2406
2979
3217
3396
$ tail big_sample
999998726
999998795
999998805
999998901
999998957
999999083
999999114
999999193
999999353
999999514

on a Pentium M 1500MHz, 512MB RAM

The algorithm:

Nothing special. Just store the seen values in a hash, check if new value=
s =20
are already in the hash and when finished, return the sorted keys of the =
=20
hash. This needs about 19s on my machine.

The (real) problem:

Writing the results to stdout. As it turned out, this is very slow if you=
=20
just do x.each { |el| puts el }. After some experiments I figured out, =20
that garbage collection is the problem.
Each puts el generates two new Strings (didn't really check that): el.to_=
s =20
an "\n". If you do this 5000000 times the garbage collector is triggered =
=20
very often. It gets faster if you do something like that:

nl =3D "\n"
x.each { |el| print el, nl }

But the "real solution" is GC.disable. That has a downside of course. The=
=20
above run of sample.rb needs approx. 400MB of RAM. So, don't try this at =
=20
home if you have less than 512MB ;-)

I also have a debug/benchmark feature, that prints the time for each =20
phase, just do:

$ time ruby -d sample.rb 5_000_000 1_000_000_000 > big_sample
start at 3.79085540771484e-05
rand at 14.2052850723267
sort at 19.433424949646
print at 31.1665999889374

real 0m31.411s
user 0m30.318s
sys 0m0.757s

The code:

if $DEBUG
def ptime(evt)
$ptimeg ||=3D Time.now.to_f
STDERR.puts "#{evt} at #{Time.now.to_f - $ptimeg}"
end
else
def ptime(evt)
# noop
end
end

# the actuall sampling, just store the seen values in a hash and return t=
he
# sorted hash keys
def sample(cnt, lim)
x =3D {}
tmp =3D nil

for i in 0...cnt
while x.has_key?(tmp =3D rand(lim))
end
x[tmp] =3D true
end
ptime "rand"

x =3D x.keys.sort
ptime "sort"
x
end

# this is the key to success, but needs lots of ram
GC.disable

ptime "start"

x =3D sample(cnt=3DARGV[0].to_i, ARGV[1].to_i)

# creating the newline string only once saves 5s
nl =3D "\n"
i =3D 0
while i+10 <=3D cnt
# this is saves about 1s
print x, nl, x[i+1], nl, x[i+2], nl, x[i+3], nl, x[i+4],
nl, x[i+5], nl, x[i+6], nl, x[i+7], nl, x[i+8], nl, x[i+9], nl
i +=3D 10
end
for j in i...cnt
print x[j], nl
end
ptime "print"

Joost Diepenmaat · Jul 17, 2005

From: "Joost Diepenmaat said:
From: "Joost Diepenmaat said:

"Olaf Klischat" <[email protected]>

See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I used the same approach. I wasn't sure if it was legal
or not. But it was sure easy...

I'm sure it was

I'm now down to 1.24 seconds without "cheating":

Click to expand...

erm, make that 1 MINUTE 24 seconds. I'm not *that* good

Click to expand...

I *think* I've got a non-cheating one (I'll leave that to
y'all to decide later) that runs in:

$ time ruby sample-c.rb 5_000_000 1_000_000_000 > big_sample-c.txt
real 0m55.169s
user 0m53.860s
sys 0m1.210s

on the 2.4GHz Celeron system.

Hmmm...

I've had a few insights and I'm now down to

time ./sample2 5_000_000 1_000_000_000 >big.txt
real 0m44.134s
user 0m42.760s
sys 0m0.891s

also on a 2.5Ghz Celeron

head big.txt
22
40
246
382
591
683
704
789
1797
1857

tail big.txt
999997650
999997651
999997907
999998021
999998375
999998984
999999147
999999169
999999462
999999989

Joost.

Joost Diepenmaat · Jul 17, 2005

Here are *two* solutions.

The first one was my final attempt before looking at Dominik Bathon's
code:

$time ./sample2 5_000_000 1_000_000_000 >big.txt
done

real 0m45.417s
user 0m42.871s
sys 0m0.830s

That means that it's about 7 seconds faster than Dominik's code on my
machine

$ head big.txt
46
117
267
628
885
1152
1202
1293
1327
1528

$ tail big.txt
999998089
999998134
999998217
999998746
999998822
999998890
999998939
999999029
999999059
999999329

$ wc -l big.txt
5000000 big.txt

## CODE ##

#!/usr/bin/ruby -w

STDOUT.sync = false
GC.disable

num,ceil = ARGV.map { |s| s.to_i }
keep = {}
warn "init keeplist"

# just dumping the values in and then checking the length
# of the hash is *much* faster than checking if there already
# is a value in there.

while keep.length < num
keep[rand(ceil)] = true
end

warn "converting to array"

keep2 = keep.keys.sort

# note that, even with GC disabled, it's still faster to
# convert integers to strings before printing than to rely on
# the automatic conversion

warn "converting to strings"
keep3 = []
while i = keep2.shift
keep3 << i.to_s
end

warn "printing"

puts keep3

warn "done"

## END ##

I've also made a more optimized version based on some observations from
Dominik Bathon's solution:

$ time ./sample3 5_000_000 1_000_000_000 >big.txt
init keeplist
converting to array
writing

real 0m29.010s
user 0m27.775s
sys 0m0.990s

$ head big.txt
926
1118
1274
1607
1740
1870
1909
2059
2268
2291

$ tail big.txt
999997896
999998081
999998251
999998367
999998762
999998765
999998847
999999195
999999347
999999615

$ wc -l big.txt
5000000 big.txt

Warning: this code takes up almost 500 Mb to run!

## CODE ##

#!/usr/bin/ruby -w

STDOUT.sync = false
GC.disable

num,ceil = ARGV.map { |s| s.to_i }
keep = {}
warn "init keeplist"

# heavily unrolled loops

# shaves of quite a bit of time

smpl = num - 100
while keep.length <= smpl
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
keep[rand(ceil)] = true
end

while keep.length < num
keep[rand(ceil)] = true
end

warn "converting to array"

keep2 = keep.keys.sort
$,="\n"
e =""

warn "writing"
while keep2.length >= 100
STDOUT.write [
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
keep2.shift.to_s,
e];
end

puts keep2 unless keep2.empty?;
exit;

## END ##

Joost.

Reto Schuettel · Jul 17, 2005

Hi

Thats my solution. It isn't really fast, but that wasn't the target anywa=
y:

| #!/usr/bin/ruby
|=20
| def sample(members, max)=20
| results =3D []
|=20
| 0.upto(max -1) do |nr|
| nrs_available =3D (max - nr)
| nrs_needed =3D (members - results.length)
|=20
| # calculate the ratio between the needed members and the available =
nrs=20
| ratio =3D nrs_needed.to_f / nrs_available
|=20
| if (ratio + rand) >=3D 1
| results << nr=20
| end
| end
|=20
| return results
| end
|=20
| members =3D ARGV.shift.to_i
| max =3D ARGV.shift.to_i=20
|=20
| raise "MEMBERS must be smaller than MAX" unless members <=3D max
|=20
| puts sample(members, max)

With 10 members and a max of 100 the script produces the following distri=
bution
after 100K runs:
http://rafb.net/paste/results/XCxeNI68.nln.html (distribution in percen=
t)

Gruss
Reto Sch=FCttel

Bill Kelly · Jul 17, 2005

Hi,

On a 2.4 GHz Celeron, I got:

real 0m55.169s
user 0m53.860s
sys 0m1.210s

...after adding GC.disable, I get:

real 0m31.565s
user 0m30.500s
sys 0m1.060s

This was my final solution (sans GC.disable, since I'd
forgotten about that 'till reading Dominik's solution.)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/usr/bin/env ruby

num_samples = ARGV.shift.to_i
upper_bound = ARGV.shift.to_i

uniq = {}
data = []

warn "calc..."

num_samples.times do
r = rand(upper_bound)
if uniq[r]
num_samples.times do
r = 0 if (r += 1) >= upper_bound
break if uniq[r].nil?
end
end
data << uniq[r] = r
end

warn "sort..."
data.sort!

warn "stringify..."
data.map! {|n| n.to_s }

warn "join..."
res = data.join("\n")

warn "output..."
puts res

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I append the samples to the data array as they are produced,
because it seemed to be faster than doing Hash#keys afterward.
(It may have not been very significant... I don't remember.)

I fiddled with how to output the data for awhile... puts of
the entire string was lightning fast, and joining an array
of strings is pretty fast (considerably faster than joining
an array of fixnums, surprisingly.) Mapping the array of
fixnums to strings explicitly prior to calling #join, seemed
faster than letting #join do the conversion (though this
seems strange/counterintuitive to me.)

My method of "finding the next slot" in a loop on a collision
may be overly cheesy - I don't know. I did consider just
asking for a new sample until finding one that didn't collide
(I like Joost's "dumping the values in and checking the
length" approach.) However I've tended to avoid that approach,
since when the number of samples desired approaches the
upper_bound, e.g. 5_000_000 5_000_000 worst case, I've been
burned in the past by such algorithms working very hard to find
the last few empty slots. However, it clearly wasn't a problem
for the 5_000_000 1_000_000_000 parameters for this quiz - I
just wanted to explain why I did it differently.

Here's the output -

$ time ruby sample-c.rb 5_000_000 1_000_000_000 > big_sample-c.txt
calc...
sort...
stringify...
join...
output...

real 0m55.169s
user 0m53.860s
sys 0m1.210s

$ wc big_sample-c.txt
5000000 5000000 49445562 big_sample-c.txt

$ head big_sample-c.txt
13
41
870
1225
1281
1434
1649
1921
1991
3047

$ tail big_sample-c.txt
999997887
999998139
999998335
999998632
999998893
999998947
999999169
999999219
999999271
999999587

Thanks for the fun quiz! It's the first one I've completed.

Regards,

Bill

Stefan Mahlitz · Jul 17, 2005

Dominik said:
Here is my solution:

The result:

$ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample

real 0m30.493s
user 0m29.435s
sys 0m0.681s

on a Pentium M 1500MHz, 512MB RAM

$ time ruby Sampling.rb 5_000_000 1_000_000_000 > big_sample

real 5m20.404s
user 5m18.600s
sys 0m1.480s

on a 1.2 GHz Duron with 512 MB RAM.

The odd thing is, that I could not see where to optimise to get near the
below 1 minute border.

In the end I tried 5 approaches (one being inspired by Dominiks idea
with the Hash, one by Joost).

I wrote a simple class to hold the different methods I used, where
@count is the number of members to be reported, @limit is the upper
bound of the range. @numbers is an Array and @numberHash a - guess - Hash.

The obvious method (at least for me) was to insert a random number into
an Array, checking whether the Array did not include the number and
redo-ing if it was already included.

def generate_with_array_include
@numbers.clear
@count.times do
newNumber = rand(@limit)
if @numbers.include?(newNumber)
redo
else
@numbers << newNumber
end
end
end

A different way to do the same is

def generate_with_uniq_first
@numbers.clear
@count.times do
@numbers << rand(@limit)
end
@numbers.uniq!
while @numbers.length < @count
@numbers << rand(@limit)
@numbers.uniq!
end
end

The idea behind this is, that with a big difference between @count and
@limit there is a slight chance, that no duplicate numbers will be
returned by Kernel#rand. That needs to be checked - so call Array#uniq!
to delete the duplicates. After that simply loop until the Array is full
(means, it has the required number of elements), calling uniq! in each
iteration. That performed better than 'generate_with_array_include' (30
times faster).

def generate_with_uniq_second
@numbers.clear
while @numbers.length < @count
(@count - @numbers.length).times do
@numbers << rand(@limit)
end
@numbers.uniq!
end
end

This is even faster (takes only half the time as
'generate_with_uniq_first') and looks better to me.

The 4th method uses the Hash approach Dominik chose.

def generate_with_hash_has_key
@numberHash.clear
@count.times do
newNumber = rand(@limit)
if @numberHash.has_key?(newNumber)
redo
else
@numberHash[newNumber] = nil
end
end
@numbers = @numberHash.keys
end

This is an adaptation to using hash instead of Array from my initial
algorithm.

This method is slightly faster than my second try with uniq! - with
@count being 5_000 and @limit 1_000_000. With bigger numbers the
difference grows.

The last line stores the keys in the @numbers-Array, so I didn't need to
change my output method - which looks like that:

def to_s
@numbers.sort.join("\n")
end

The 5th method is inspired by Joosts Hash#length idea. After I read it I
thought "This is so simple - why did you spend half the day thinking
about how to reach their 'below 1 minute' timings?".

def generate_with_hash_length
while @numberHash.length < @count
@numberHash[rand(@limit)] = nil
end
@numbers = @numberHash.keys
end

And this is the winner (from what I tried).

Here are some timings (without output) from a smaller sample (look at
'generate_with_array_include', I really didn't want to use this with
bigger numbers)

ruby -v Sampling.rb 50_000 10_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
user system total real
generate_with_uniq_first 17.790000 0.020000 17.810000 ( 17.806102)
generate_with_uniq_second 0.950000 0.010000 0.960000 ( 0.956298)
generate_with_array_include 243.300000 0.090000 243.390000 (243.969158)
generate_with_hash_has_key 0.310000 0.000000 0.310000 ( 0.308837)
generate_with_hash_length 0.020000 0.000000 0.020000 ( 0.019657)

And here some timings (without output as well) from the top3 methods
(using the original parameters)

ruby -v Sampling.rb 5_000_000 1_000_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
user system total real
generate_with_uniq_second 95.240000 1.070000 96.310000 ( 96.511207)
generate_with_hash_has_key 63.570000 0.970000 64.540000 ( 64.641186)
generate_with_hash_length 5.720000 0.000000 5.720000 ( 5.718732)

So, even my best method using Array and uniq! turned out to be not
suited. Well, at the end I discovered the bmbm method of Benchmark. It
gives different results for the second run:

ruby -v Sampling.rb 5_000_000 1_000_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
Rehearsal --------------------------------------------------------------
generate_with_uniq_second 95.640000 0.850000 96.490000 ( 96.563648)
generate_with_hash_has_key 63.280000 1.270000 64.550000 ( 64.687427)
generate_with_hash_length 5.720000 0.000000 5.720000 ( 5.718150)
--------------------------------------------------- total: 166.760000sec

user system total real
generate_with_uniq_second 99.790000 0.740000 100.530000 (100.740020)
generate_with_hash_has_key 45.050000 0.830000 45.880000 ( 45.987040)
generate_with_hash_length 3.130000 0.000000 3.130000 ( 3.132459)

I would have never thought about using a Hash in this quiz. I really do
not need the value of any stored key. It is an Array-problem. List of
numbers. Sorted. (unique)

Why use a Hash?

Some of you did use it. Do you mind telling me why?

Anyway, it was a great quiz.

Final timings, now using 'generate_with_hash_length':

time ruby -d Sampling.rb 5_000_000 1_000_000_000 > big_sample_new

real 1m17.537s
user 0m58.200s
sys 0m1.180s

Stefan Mahlitz

Joost Diepenmaat · Jul 17, 2005

My method of "finding the next slot" in a loop on a collision
may be overly cheesy - I don't know. I did consider just
asking for a new sample until finding one that didn't collide
(I like Joost's "dumping the values in and checking the
length" approach.) However I've tended to avoid that approach,
since when the number of samples desired approaches the
upper_bound, e.g. 5_000_000 5_000_000 worst case, I've been
burned in the past by such algorithms working very hard to find
the last few empty slots. However, it clearly wasn't a problem
for the 5_000_000 1_000_000_000 parameters for this quiz - I
just wanted to explain why I did it differently.

I thought about that too, but I chose to ignore the problem because:

a) I assumed that if you want random samples from a set, you generally
want much less samples than the size of the total set.

b) As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).
I did a few runs that suggest that sample.size == total.size / 2 is
relatively slow on my implementation, but not impractal. Your milage
may vary if you use a crappy random number generator, ofcourse

Joost.

Joost Diepenmaat · Jul 17, 2005

The 5th method is inspired by Joosts Hash#length idea. After I read it I
thought "This is so simple - why did you spend half the day thinking
about how to reach their 'below 1 minute' timings?".

It took me 2 days to simplify it down to that

Joost.

Matthew D Moss · Jul 17, 2005

Here are three versions I wrote, in order. Definitely not fast, but
they work.

# Version 1
(k, n) = ARGV.map { |s| s.to_i }
n.times do |i|
r = rand * (n - i)
unless (k < r)
puts i
k -= 1
end
end

# Version 2
(k, n) = ARGV.map { |s| s.to_i }
puts (0...n).sort_by { rand }[0...k].sort!

# Version 3
(k, n) = ARGV.map { |s| s.to_i }
x = (0...n).to_a
k.times do |i|
r = i + rand(n - i)
x,x[r] = x[r],x
end
x[0...k].sort

Adam Sanderson · Jul 17, 2005

Here is my sample, it sounds like some other folks did a moving window
like this, and that some people also thought it wasn't truly random.
As far as I'm concerned, it's random enough

This was a fairly fun quiz. I was thinking of also trying to do
something along the lines of picking a random number within the range,
and then picking from within the resulting ranges. It would avoid one
having to sort the results, and you wouldn't get collisions. Maybe
I'll whip it up later.

------------------------------------------------
Tiny bit of code
------------------------------------------------
# Moving window Sample
# Adam Sanderson
max = ARGV.pop.to_i
samples = ARGV.pop.to_i

step = max / samples
min = -step

samples.times{
puts( (min += step) + rand(step) )
}

------------------------------------------------
Results
------------------------------------------------
netghost@mu /cygdrive/c/Projects
$ time ruby number_sample.rb 5_000_000 1_000_000_000 > num_res.txt

real 0m36.757s
user 0m0.015s
sys 0m0.000s

netghost@mu /cygdrive/c/Projects
$ tail num_res.txt
999998016
999998349
999998503
999998767
999998840
999999156
999999281
999999530
999999779
999999987

netghost@mu /cygdrive/c/Projects
$ head num_res.txt
40
263
560
640
925
1003
1288
1428
1626
1901

Ezra Zygmuntowicz · Jul 17, 2005

Here was my original naive solution which got me down to 28 seconds
but as people have noted was not really random:

#!/usr/local/bin/ruby
members, limit, index = ARGV[0].to_i, ARGV[1].to_i, 0
member_range = limit / members
0.upto(members-1) do
res = rand member_range
puts res + index
index += member_range
end

That was pretty fast but wasn't a real solution. Here is my final
solution:

#!/usr/local/bin/ruby

GC.disable

members, limit, samples = ARGV[0].to_i, ARGV[1].to_i, {}

loop do
break unless samples.length < members
samples[rand(limit)] = 1
end

result = samples.keys.sort

puts result

It looks like I went pretty much the same route that Joost went.
But I didn't optimize it as much as he did. The results are:

ezras-powerbook-g4-17:~/Desktop ez$ time ./sample.rb 5_000_000
1_000_000_000 >big_sample.txt

real 0m47.496s
user 0m44.523s
sys 0m2.013s
ezras-powerbook-g4-17:~/Desktop ez$ wc big_sample.txt
5000000 5000000 49443742 big_sample.txt
ezras-powerbook-g4-17:~/Desktop ez$ head big_sample.txt
162
312
314
400
564
731
868
1086
1235
1344
ezras-powerbook-g4-17:~/Desktop ez$ tail big.txt
999998406
999998718
999998808
999998843
999998848
999999159
999999307
999999776
999999847
999999864

This was a fun quiz. I really enjoyed it. Thanks James

-Ezra Zygmuntowicz
WebMaster
Yakima Herald-Republic Newspaper
(e-mail address removed)
509-577-7732

James Edward Gray II · Jul 17, 2005

Here is my solution:

I need a little revenge, since you slaughtered my initial quiz
time...

Neo:~/Documents/Ruby/Ruby Quiz/solutions/Dominik Bathon$ time ./
sample 999_999 1_000_000 > /dev/null

real 0m28.151s
user 0m27.846s
sys 0m0.298s
Neo:~/Documents/Ruby/Ruby Quiz/solutions/Dominik Bathon$ cd ../James\
Edward\ Gray\ II/
Neo:~/Documents/Ruby/Ruby Quiz/solutions/James Edward Gray II$ time ./
sample 999_999 1_000_000 > /dev/null

real 0m5.903s
user 0m5.884s
sys 0m0.019s

Not really picking on you Dominik. Just showing that there are other
problems in this challenge besides the one I initially posted.

My code above is what I used in creating the quiz. I'll post it when
I get some time.

James Edward Gray II

Dominik Bathon · Jul 18, 2005

I need a little revenge, since you slaughtered my initial quiz time... = =20

I have wondered about this, since I first read this week's quiz: Did your=
=20
solution really need half an hour? What is it doing that long?

Neo:~/Documents/Ruby/Ruby Quiz/solutions/Dominik Bathon$ time ./sample = =20
999_999 1_000_000 > /dev/null

real 0m28.151s
user 0m27.846s
sys 0m0.298s
Neo:~/Documents/Ruby/Ruby Quiz/solutions/Dominik Bathon$ cd ../James\ =20
Edward\ Gray\ II/
Neo:~/Documents/Ruby/Ruby Quiz/solutions/James Edward Gray II$ time ./=20
sample 999_999 1_000_000 > /dev/null

real 0m5.903s
user 0m5.884s
sys 0m0.019s

Not really picking on you Dominik. Just showing that there are other =20
problems in this challenge besides the one I initially posted.

Yes, I (and most of the others) didn't handle that special case. As Joost=
=20
Diepenmaat said in his reply to Bill Kelly's solution:

As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

I've extended my solution a bit to see how many values are rejected until=
=20
999_999 unique values in 0...1_000_000 are found: about 12_000_000

Olaf Klischat · Jul 18, 2005

Stefan Mahlitz said:
I really do
not need the value of any stored key.
Right.

Why use a Hash?

Efficiency. Array#include? runs in linear time (i.e. the running time
is proportional to the size of the array because all elements have to
be tested sequentially), while Hash#[] runs in constant time (i.e. the
running time is independent of the size of the hash). So the
array-based solution will scale very badly for large numbers.

[SOLUTION] [QUIZ] Sampling (#39)	1	Jul 18, 2005
[SUMMARY] Sampling (#39)	0	Jul 21, 2005
Minimum Total Difficulty	0	Nov 15, 2023
Register Question	0	Oct 21, 2024
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
[QUIZ] Befunge (#184)	24	Nov 22, 2008
[QUIZ] Counting Toothpicks (#111)	47	Jan 26, 2007
[QUIZ] Symbolify (#169)	74	Jul 11, 2008

[QUIZ] Sampling (#39)

Paolo Capriotti

Kero

ruby quiz

Mauricio Fernández

Wybo Dekker

Paolo Capriotti

Dominik Bathon

Joost Diepenmaat

Joost Diepenmaat

Reto Schuettel

Bill Kelly

Stefan Mahlitz

Joost Diepenmaat

Joost Diepenmaat

Matthew D Moss

Adam Sanderson

Ezra Zygmuntowicz

James Edward Gray II

Dominik Bathon

Olaf Klischat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads