S
Steven Bethard
I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:
py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:
py> def select(dicts, n):
.... result = []
.... counts = {}
.... def doadd(d):
.... for label, count in d.iteritems():
.... if counts.get(label, 0) + count > n:
.... return False
.... return True
.... for d in dicts:
.... if doadd(d):
.... result.append(d)
.... for label, count in d.iteritems():
.... counts[label] = counts.get(label, 0) + count
.... return result
....
Basically, I use a greedy approach -- adding a dict each time if I can.
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select function
produces such a suboptimal solution:
py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]
The better solution would have been to select the second two dicts --
then all 'words' would have count 12.
I don't need an optimal solution; in fact, the solution I have right now
is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!
STeVe
P.S. No, I'm not just making these problems up! I'm sick, but not that
sick. It has to do with resampling a set of data... If you're
really interested, I can provide the full details, but they'll only make
the problem even _more_ complicated. =)
'words', e.g.:
py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:
py> def select(dicts, n):
.... result = []
.... counts = {}
.... def doadd(d):
.... for label, count in d.iteritems():
.... if counts.get(label, 0) + count > n:
.... return False
.... return True
.... for d in dicts:
.... if doadd(d):
.... result.append(d)
.... for label, count in d.iteritems():
.... counts[label] = counts.get(label, 0) + count
.... return result
....
Basically, I use a greedy approach -- adding a dict each time if I can.
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select function
produces such a suboptimal solution:
py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]
The better solution would have been to select the second two dicts --
then all 'words' would have count 12.
I don't need an optimal solution; in fact, the solution I have right now
is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!
STeVe
P.S. No, I'm not just making these problems up! I'm sick, but not that
sick. It has to do with resampling a set of data... If you're
really interested, I can provide the full details, but they'll only make
the problem even _more_ complicated. =)