[QUIZ] Sampling (#39)

J

James Edward Gray II

I'll just wait until the last and final specification comes out...

The quiz does say both unique and sorted. I've posted here about my
mistake of forgetting to mention random and updated the Ruby Quiz
site to reflect this.

The task now includes everything I intended. Consider it finalized. :)

James Edward Gray II
 
J

Joost Diepenmaat

Dual 2.0 Ghx G5. 2 Gigs of RAM.


IO is certainty a part of this quiz. The benchmark library is your
friend... ;)

One of the interesting findings for me was that

puts array_of_strings

is *very* much faster than

puts array_of_integers

Joost.
 
J

Joost Diepenmaat

One of the interesting findings for me was that

puts array_of_strings

is *very* much faster than

puts array_of_integers

Joost.

And by the way:
time ./sample2 5_000_000 1_000_000_000 >big.txt
real 2m5.395s
user 1m39.892s
sys 0m11.576s
head big.txt 141
441
854
876
968
1123
1276
1317
1557
1587

tail big.txt
999998078
999998164
999998558
999998626
999998650
999998814
999998912
999999127
999999160
999999385

Slightly more than 2 minutes wallclock time. That means it's just fast
enough to keep me from getting bored and killing the process :) I
managed to shave off about 8 minutes run time by trying a couple of
different strategies with the IO and int -> string conversions. The
"sample algorithm" itself is the second-dumbest I could think of - very
easy to prove correct, though :)

system info:

Single Intel(R) Celeron(R) CPU 2.40GHz w / 1GB of memory running Linux
2.6.11

Joost.
 
Y

Yohanes Santoso

7 lines (no funky, 'unnatural' lines), 30-ish seconds.

ysantoso@jenny:~/tmp$ wc -l ./sample.rb
7 ./sample.rb

ysantoso@jenny:~/tmp$ /usr/bin/time ruby1.8 ./sample.rb 5_000_000 1_000_000_000 > big_sample.txt
29.00user 1.10system 0:32.32elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+474minor)pagefaults 0swaps

ysantoso@jenny:~/tmp$ ls -la ./big_sample.txt
-rw-r--r-- 1 ysantoso ysantoso 49444445 2005-07-16 03:05 ./big_sample.txt

ysantoso@jenny:~/tmp$ wc -l ./big_sample.txt
5000000 ./big_sample.txt

ysantoso@jenny:~/tmp$ head big_sample.txt
31
275
451
647
909
1090
1354
1527
1784
1964

ysantoso@jenny:~/tmp$ tail big_sample.txt
999998095
999998357
999998521
999998724
999998875
999999105
999999237
999999537
999999600
999999828

ysantoso@jenny:~/tmp$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) XP 2100+
stepping : 1
cpu MHz : 1729.356
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse pni syscall mmxext 3dnowext 3dnow
bogomips : 3416.06


YS.
 
W

Wybo Dekker

The quiz does say both unique and sorted. I've posted here about my mistake
of forgetting to mention random and updated the Ruby Quiz site to reflect
this.

The task now includes everything I intended. Consider it finalized. :)

sorry - I missed the original posting.

My results:

Averages over 83 runs, CPU 2.4GHz / 1GB, algorithm (without IO) 4 lines:
calculation 18.59 s
printing 15.55 s
total 34.14 s
filesize 49444596 +/- 852 (standard deviation)

I wonder what other people find for filesize and standard deviation?
 
M

Mauricio Fernández

Heres my final solutions time. again on a 1.6Ghz G5. I got the code
down to 7 lines and I don't think I can get it any faster.

ezra:~/Sites/______raill_stuff ez$ time ./sample.rb 5_000_000
1_000_000_000 > big_sample.txt

real 0m31.023s
user 0m29.406s
sys 0m0.857s

Let the cheating begin:

batsman@tux-chan:/tmp$ time ruby ./sample2.rb 5_000_000 1000_000_000 > big_sample.txt

real 0m2.637s
user 0m2.370s
sys 0m0.270s
batsman@tux-chan:/tmp$ wc big_sample.txt
5000000 5000000 49444494 big_sample.txt
batsman@tux-chan:/tmp$ tail big_sample.txt
999998676
999998939
999998953
999999212
999999272
999999344
999999665
999999724
999999908
999999982
batsman@tux-chan:/tmp$ head big_sample.txt
184
271
649
900
930
1266
1277
1370
1620
1842
 
O

Olaf Klischat

Ruby Quiz said:
This week's Ruby quiz is a classic sampling problem that I've seen many
interesting solutions for in the past.

The challenge is to implement a program called "sample" that takes exactly two
integer inputs. The first of those should be the number of members you would
like included in the sample. The second is the upper boundary (exclusive) of
the range of integers members can be selected from. The lower boundary is zero
(inclusive).

Your program should output exactly the requested number of members from the
defined range to STDOUT, one number per line. Each member must be unique and
they should appear in sorted order.
(and truly random)

Well, how about requiring O(1) memory complexity for the solution?
Could be too unchallenging otherwise :)
 
O

Olaf Klischat

Ezra Zygmuntowicz said:
ezra:~/Sites ez$ time ./sample.rb 5_000_000 1_000_000_000 >
big_sample.txt

real 0m43.838s
user 0m35.820s
sys 0m1.360s
ezra:~/Sites ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444445 Jul 15 10:46 big_sample.txt
ezra:~/Sites ez$ head big_sample.txt
168
285
566
604
912
1183
1335
1473
1728
1919
ezra:~/Sites ez$ tail big_sample.txt
999998155
999998313
999998484
999998680
999998825
999999151
999999330
999999465
999999621
999999877
ezra:~/Sites ez$

Umm... I'm not sure, but that looks a bit too "equidistant" to be
truly random, doesn't it?

The sample being truly random means that the sample should be a truly
"drawing without putting back" (e.g. lottery) sample, so each possible
sample occurs with equal probability. So a sample like

0
1
2
..
..
..
4999999

should occur with the same probability as any other more "likely" one.

That rules out certain more "obvious" (non-)implementations.
 
J

Jacob Fugal

=20
Umm... I'm not sure, but that looks a bit too "equidistant" to be
truly random, doesn't it?
=20
The sample being truly random means that the sample should be a truly
"drawing without putting back" (e.g. lottery) sample, so each possible
sample occurs with equal probability. So a sample like
=20
0
1
2
..
..
..
4999999
=20
should occur with the same probability as any other more "likely" one.

See my other posts in this thread about the actual probabilities. In
short, since it's fully random, each possible sampling is as likely as
any other possible sampling, but the number of samplings including at
least ten numbers in the 99999xxxx range and at least ten numbers in
the (00000)xxxx range is *much* higher than the number of samplings
without numbers in those ranges. So the probability of getting a
sampling that looks evenly spread out is much more likely than getting
a sampling that's clustered.

Jacob Fugal
 
J

James Edward Gray II

Well, how about requiring O(1) memory complexity for the solution?
Could be too unchallenging otherwise :)

Feel free to make your own challenge as tricky as you like it.

James Edward Gray II
 
O

Olaf Klischat

Jacob Fugal said:
See my other posts in this thread about the actual probabilities. In
short, since it's fully random, each possible sampling is as likely as
any other possible sampling, but the number of samplings including at
least ten numbers in the 99999xxxx range and at least ten numbers in
the (00000)xxxx range is *much* higher than the number of samplings
without numbers in those ranges. So the probability of getting a
sampling that looks evenly spread out is much more likely than getting
a sampling that's clustered.

Of course. I didn't mean to say that a really "clustered" sample like
0...499999999 (or any other specific sample) has any significant
probability.

But if you look at Ezra's output:

0
168 200


999999800
999999877
1000000000

See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I caught this because I had the same idea first :)

[1]

Unless I'm mistaken, in the 5e6-from-1e9 sampling, the probability
that a sampling contains exactly one number from a given 200-numbers
interval is 200.0*(1/200)*(199.0/200)**199 = 0.3688. The probability
that this happens for 20 such 200-numbers intervals is
0.3688**20 = 2.1e-09.
 
E

Ezra Zygmuntowicz

--Apple-Mail-4--653025233
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

999999621
999999800

999999877
1000000000

See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I caught this because I had the same idea first :)

[1]

Unless I'm mistaken, in the 5e6-from-1e9 sampling, the probability
that a sampling contains exactly one number from a given 200-numbers
interval is 200.0*(1/200)*(199.0/200)**199 = 0.3688. The probability
that this happens for 20 such 200-numbers intervals is
0.3688**20 = 2.1e-09.
Yep you caught me Olaf ;-) It looks like quite a few people stumbled
upon this 'pseudo' random sampling technique. I'm working on a true
implementation of this but so far it take __much__ longer than my
naive impl. Looking at the rules though.. I'm not certain that my
fast technique is illegal. I guess we will wait until tomorrow and
let the quiz master decide ;-)

-Ezra Zygmuntowicz
WebMaster
Yakima Herald-Republic Newspaper
(e-mail address removed)
509-577-7732


--Apple-Mail-4--653025233--
 
J

James Edward Gray II

Looking at the rules though.. I'm not certain that my fast
technique is illegal. I guess we will wait until tomorrow and let
the quiz master decide ;-)

I consider myself more the impartial announcer. I leave it to you to
decide exactly what the quiz means to you... ;)

James Edward Gray II
 
B

Bill Kelly

From: "Olaf Klischat said:
See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I used the same approach. I wasn't sure if it was legal
or not. But it was sure easy... :)


Regards,

Bill
 
J

Joost Diepenmaat

From: "Olaf Klischat said:
See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I used the same approach. I wasn't sure if it was legal
or not. But it was sure easy... :)

I'm sure it was :)

I'm now down to 1.24 seconds without "cheating":

time ./sample2 5_000_000 1_000_000_000 >big.txt
real 1m24.165s
user 1m17.554s
sys 0m5.252s

head big.txt
30
64
65
760
1136
1418
1536
1796
2190
2306

tail big.txt
999998474
999998602
999998676
999998802
999998909
999999036
999999048
999999080
999999196
999999467

wc -l big.txt
5000000 big.txt

But I'm pretty much out of ideas on how to make it faster.

Joost.
 
J

Joost Diepenmaat

From: "Olaf Klischat said:
See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I used the same approach. I wasn't sure if it was legal
or not. But it was sure easy... :)

I'm sure it was :)

I'm now down to 1.24 seconds without "cheating":


erm, make that 1 MINUTE 24 seconds. I'm not *that* good :)

J.
 
R

Randy Kramer

From: "Olaf Klischat said:
See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I used the same approach. I wasn't sure if it was legal
or not. But it was sure easy... :)

This is off-point for this particular post, and I apologize for not
reading/remembering the original challenge, but when someone commented about
it seeming odd that there were so many numbers prefixed, iirc, with something
like 999999.., I got to thinking--the odd thing about this challenge is
sorting the numbers. I don't think many of us are used to looking at a
sorted list of random numbers and trying to judge whether they are random (if
that's possible).

Two examples:

If you took 5 random numbers from in the interval 0..10, you might see
something like this: 9, 3, 2, 6,4--"looks" fairly random. Now sort that
list: 2, 3, 4, 6, 9--"looks" a lot less random.

For even more fun, do it for a series of coin flips (with H for heads, T for
tails, (and E for edge ;-))--you might get a sequence (for 6 flips) like H,
T, T, H, T, H. Now sort that list: H, H, H, T, T, T--doesn't look very
random at all.

regards,
Randy Kramer

PS: Just for the record, if someone implemented a simplified approach which
took a random number on each interval of 200 numbers in the original range, I
would have to say that was the wrong approach (or, maybe more accurately, I
would hope that I had specified the problem well enough so that such was not
an acceptable approach).
 
B

Brad Wilson

------=_Part_5859_517123.1121560729718
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

=20
PS: Just for the record, if someone implemented a simplified approach=20
which
took a random number on each interval of 200 numbers in the original=20
range, I
would have to say that was the wrong approach (or, maybe more accurately,= =20
I
would hope that I had specified the problem well enough so that such was= =20
not
an acceptable approach).


I feel the exact opposite, as it turns out.

Often times, I find it more interesting when people find the more creative=
=20
solution. Clearly here, the more creative solution is orders of magnitude=
=20
faster than the "original sought for answer". Sure, this is a quiz, and you=
=20
want people to do something specific, but I also look at these kinds of=20
things as a way to flex my mental muscles. Sometimes, the "best" solution=
=20
_is_ the fastest solution, and I for one don't think you should fault a=20
solution that was within the confines of the problem but didn't exactly=20
mimic the answer the question poser had in mind.

- Brad

------=_Part_5859_517123.1121560729718--
 
B

Bill Kelly

From: "Joost Diepenmaat said:
From: "Olaf Klischat" <[email protected]>

See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I used the same approach. I wasn't sure if it was legal
or not. But it was sure easy... :)

I'm sure it was :)

I'm now down to 1.24 seconds without "cheating":

erm, make that 1 MINUTE 24 seconds. I'm not *that* good :)

I *think* I've got a non-cheating one (I'll leave that to
y'all to decide later) that runs in:

$ time ruby sample-c.rb 5_000_000 1_000_000_000 > big_sample-c.txt
real 0m55.169s
user 0m53.860s
sys 0m1.210s

on the 2.4GHz Celeron system.


$ wc big_sample-c.txt
5000000 5000000 49445562 big_sample-c.txt

$ head big_sample-c.txt
13
41
870
1225
1281
1434
1649
1921
1991
3047

$ tail big_sample-c.txt
999997887
999998139
999998335
999998632
999998893
999998947
999999169
999999219
999999271
999999587


Regards,

Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,175
Messages
2,570,946
Members
47,497
Latest member
PilarLumpk

Latest Threads

Top