A Friday Python Programming Pearl: random sampling

M

Mark Dickinson

For a lazy Friday evening, here's a Python algorithm that seemed so
cute that I just had to share it with everyone. I'm sure it's well
known to many here, but it was new to me. Skip directly to the
'sample2' function to see the algorithm and avoid the commentary...

Suppose that you want to select a number of elements, k, say, from a
population, without replacement. E.g., selecting 3 elements from
range(30) might give you:

[13, 3, 27]

Order matters, so the above is considered distinct from [3, 13, 27].
And you want to be sure that each possible selection has equal
probability of occurring (to within the limits of the underlying
PRNG).

One solution is to select elements from the population one-by-one,
keep track of the indices of already-selected elements in a set, and
if you end up selecting something that's already in your set, simply
try again. Something like this (code stolen and adapted from
Random.sample in Python's standard library 'random' module):

from random import randrange

def sample1(population, k):
n = len(population)
result = [None] * k
selected = set()
for i in range(k):
j = randrange(n)
# retry until we get something that's not already selected
while j in selected:
j = randrange(n)
selected.add(j)
result = population[j]
return result


N.B. The above is Python 3 code; for Python 2, replace range with
xrange.

All that's required of 'population' here is that it implements __len__
and __getitem__. The method works well for k significantly smaller
than n, but as k approaches n the number of reselections required
increases. So for larger k, Random.sample uses a different method:
roughly, make a copy of 'population', do a partial in-place shuffle of
that copy that randomizes the first k elements, and return those.
This second method isn't so great when k is small and n is huge, since
it ends up being O(n) from the list copy, but it works out that the
two methods complement each other nicely.

Looking at the above code, I was idly wondering whether there was a
way to alter 'sample1' to avoid the need for resampling, thus giving a
single algorithm that works reasonably efficiently regardless of the
population size and requested sample size. And it turns out that
there is. The code below is similar to 'sample1' above, except that
instead of using a set to keep track of indices of already-selected
members of the population, it uses a dict; for an index i
(corresponding to a member of the population), d gives the position
that population will occupy in the resulting sample.


from random import randrange

def sample2(population, k):
n = len(population)

d = {}
for i in reversed(range(k)):
j = randrange(i, n)
if j in d:
d = d[j]
d[j] = i

result = [None] * k
for j, i in d.items():
result = population[j]
return result


Note that no resampling is required, and that there's no copying of
the population list. The really clever bit is the 'if j in d: ...'
block. If you stare at the algorithm for long enough (and it does
take some staring), you can convince yourself that after the first
'for' loop, d can be any of the n*(n-1)*...*(n-k+1)
mappings-with-no-repeated-elements from some set of k elements of
range(n) to range(k), and that each one of these mappings is equally
likely to occur. In a sense, this d is the inverse of the desired
sample, which would be a map with no repetitions from range(k) to
range(n). So inverting d, and replacing d's keys by the corresponding
population elements, gives the random sample.

N.B. I don't claim any originality for the algorithm; only for the
implementation: the algorithm is based on an algorithm attributed to
Robert Floyd, and appearing in Jon Bentley's 'Programming Pearls' book
(though that algorithm produces a set, so doesn't worry about the
ordering of the sample). But I was struck by its beauty and
simplicity, and thought it deserved to be better known.

Happy Friday!
 
B

Bryan

Mark said:
N.B.  I don't claim any originality for the algorithm; only for the
implementation: the algorithm is based on an algorithm attributed to
Robert Floyd, and appearing in Jon Bentley's 'Programming Pearls' book

Actually it is the sequel, /More Programming Pearls/.
(though that algorithm produces a set, so doesn't worry about the
ordering of the sample).

Bentley presents a version of the Floyd algorithm that provides random
order, but it requires a set data type with some idea of order, as in
"insert j in s after t". As Mark Dickinson's version uses a normal
dict(), which Bentley had already introduced under the name "associate
array", I'd say Mark's version is an improvement.
 
M

Mark Dickinson

Actually it is the sequel, /More Programming Pearls/.

Thanks for the correction. I confess that I've yet to read either
book; I'll have to try to track them down.
Bentley presents a version of the Floyd algorithm that provides random
order, but it requires a set data type with some idea of order, as in
"insert j in s after t".

Ah, nice. The dict values, of course, exactly provide the necessary
idea of order, so I guess this amounts to pretty much the same thing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,817
Latest member
AdalbertoT

Latest Threads

Top