M
Mark Dickinson
For a lazy Friday evening, here's a Python algorithm that seemed so
cute that I just had to share it with everyone. I'm sure it's well
known to many here, but it was new to me. Skip directly to the
'sample2' function to see the algorithm and avoid the commentary...
Suppose that you want to select a number of elements, k, say, from a
population, without replacement. E.g., selecting 3 elements from
range(30) might give you:
[13, 3, 27]
Order matters, so the above is considered distinct from [3, 13, 27].
And you want to be sure that each possible selection has equal
probability of occurring (to within the limits of the underlying
PRNG).
One solution is to select elements from the population one-by-one,
keep track of the indices of already-selected elements in a set, and
if you end up selecting something that's already in your set, simply
try again. Something like this (code stolen and adapted from
Random.sample in Python's standard library 'random' module):
from random import randrange
def sample1(population, k):
n = len(population)
result = [None] * k
selected = set()
for i in range(k):
j = randrange(n)
# retry until we get something that's not already selected
while j in selected:
j = randrange(n)
selected.add(j)
result = population[j]
return result
N.B. The above is Python 3 code; for Python 2, replace range with
xrange.
All that's required of 'population' here is that it implements __len__
and __getitem__. The method works well for k significantly smaller
than n, but as k approaches n the number of reselections required
increases. So for larger k, Random.sample uses a different method:
roughly, make a copy of 'population', do a partial in-place shuffle of
that copy that randomizes the first k elements, and return those.
This second method isn't so great when k is small and n is huge, since
it ends up being O(n) from the list copy, but it works out that the
two methods complement each other nicely.
Looking at the above code, I was idly wondering whether there was a
way to alter 'sample1' to avoid the need for resampling, thus giving a
single algorithm that works reasonably efficiently regardless of the
population size and requested sample size. And it turns out that
there is. The code below is similar to 'sample1' above, except that
instead of using a set to keep track of indices of already-selected
members of the population, it uses a dict; for an index i
(corresponding to a member of the population), d gives the position
that population will occupy in the resulting sample.
from random import randrange
def sample2(population, k):
n = len(population)
d = {}
for i in reversed(range(k)):
j = randrange(i, n)
if j in d:
d = d[j]
d[j] = i
result = [None] * k
for j, i in d.items():
result = population[j]
return result
Note that no resampling is required, and that there's no copying of
the population list. The really clever bit is the 'if j in d: ...'
block. If you stare at the algorithm for long enough (and it does
take some staring), you can convince yourself that after the first
'for' loop, d can be any of the n*(n-1)*...*(n-k+1)
mappings-with-no-repeated-elements from some set of k elements of
range(n) to range(k), and that each one of these mappings is equally
likely to occur. In a sense, this d is the inverse of the desired
sample, which would be a map with no repetitions from range(k) to
range(n). So inverting d, and replacing d's keys by the corresponding
population elements, gives the random sample.
N.B. I don't claim any originality for the algorithm; only for the
implementation: the algorithm is based on an algorithm attributed to
Robert Floyd, and appearing in Jon Bentley's 'Programming Pearls' book
(though that algorithm produces a set, so doesn't worry about the
ordering of the sample). But I was struck by its beauty and
simplicity, and thought it deserved to be better known.
Happy Friday!
cute that I just had to share it with everyone. I'm sure it's well
known to many here, but it was new to me. Skip directly to the
'sample2' function to see the algorithm and avoid the commentary...
Suppose that you want to select a number of elements, k, say, from a
population, without replacement. E.g., selecting 3 elements from
range(30) might give you:
[13, 3, 27]
Order matters, so the above is considered distinct from [3, 13, 27].
And you want to be sure that each possible selection has equal
probability of occurring (to within the limits of the underlying
PRNG).
One solution is to select elements from the population one-by-one,
keep track of the indices of already-selected elements in a set, and
if you end up selecting something that's already in your set, simply
try again. Something like this (code stolen and adapted from
Random.sample in Python's standard library 'random' module):
from random import randrange
def sample1(population, k):
n = len(population)
result = [None] * k
selected = set()
for i in range(k):
j = randrange(n)
# retry until we get something that's not already selected
while j in selected:
j = randrange(n)
selected.add(j)
result = population[j]
return result
N.B. The above is Python 3 code; for Python 2, replace range with
xrange.
All that's required of 'population' here is that it implements __len__
and __getitem__. The method works well for k significantly smaller
than n, but as k approaches n the number of reselections required
increases. So for larger k, Random.sample uses a different method:
roughly, make a copy of 'population', do a partial in-place shuffle of
that copy that randomizes the first k elements, and return those.
This second method isn't so great when k is small and n is huge, since
it ends up being O(n) from the list copy, but it works out that the
two methods complement each other nicely.
Looking at the above code, I was idly wondering whether there was a
way to alter 'sample1' to avoid the need for resampling, thus giving a
single algorithm that works reasonably efficiently regardless of the
population size and requested sample size. And it turns out that
there is. The code below is similar to 'sample1' above, except that
instead of using a set to keep track of indices of already-selected
members of the population, it uses a dict; for an index i
(corresponding to a member of the population), d gives the position
that population will occupy in the resulting sample.
from random import randrange
def sample2(population, k):
n = len(population)
d = {}
for i in reversed(range(k)):
j = randrange(i, n)
if j in d:
d = d[j]
d[j] = i
result = [None] * k
for j, i in d.items():
result = population[j]
return result
Note that no resampling is required, and that there's no copying of
the population list. The really clever bit is the 'if j in d: ...'
block. If you stare at the algorithm for long enough (and it does
take some staring), you can convince yourself that after the first
'for' loop, d can be any of the n*(n-1)*...*(n-k+1)
mappings-with-no-repeated-elements from some set of k elements of
range(n) to range(k), and that each one of these mappings is equally
likely to occur. In a sense, this d is the inverse of the desired
sample, which would be a map with no repetitions from range(k) to
range(n). So inverting d, and replacing d's keys by the corresponding
population elements, gives the random sample.
N.B. I don't claim any originality for the algorithm; only for the
implementation: the algorithm is based on an algorithm attributed to
Robert Floyd, and appearing in Jon Bentley's 'Programming Pearls' book
(though that algorithm produces a set, so doesn't worry about the
ordering of the sample). But I was struck by its beauty and
simplicity, and thought it deserved to be better known.
Happy Friday!