Knuth's description of RS requires distinct values.
Each item enters the reservoir iff its associated random
value is greater than the smallest already present (and
bumps the smallest out). If equality occurs an ambiguity
arises, and it's not clear how to resolve the ambiguity
without introducing a bias.
I figured it must be something like that, based on the requirement
you stated.
I don't have DDJ here to compare the algorithm you
mention with RS ;-)
I really should dig out the issue in question to provide a complete
description, but basically what it does is:
- Assume the domain being sampled contains at least as many
items (N) as the sample size (M). N is not known ahead of time.
- Populate the reservoir with the first M items from the domain
- Each remaining item is tested as a candidate for inclusion by
generating a random integer value between 0 and I-1, where I is the
total number of items read thus far. If the random value is < M, the
item in that slot in the reservoir is replaced by the current item.
(I hope that's right...)
Items read from the domain early on have a high probability of
entering the reservoir (it's 1 for the first M items), but they
undergo more tests and so are more likely to be bumped than items
that are added later.
The final item has exactly M in N (ie M/N) probability of being
added to the reservoir, since it undergoes only its initial test
for candidacy, and when it does I == N.
For all other items, the cumulative probability that they'll get
into the reservoir, and stay there, also works out to M/N.
For example:
Domain series {a,b,c,d}, reservoir size M=1.
1. a gets into the reservoir with probability 1.
2. a survives b's test with probability 1/2.
3. a survives c's test with probability 2/3 (c has a 1/3 chance).
4. a survives d's test with probability 3/4.
a's cumulative chance: (1)(1/2)(2/3)(3/4) = 6/24 = 1/4.
This algorithm just requires that you have an unbiased generator
that can provide intergers in the range [0,N-1].