R
Roy Smith
I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).
From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:
top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)
I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.
Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).
From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:
top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)
I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.
Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?