selecting dictionaries to maximize counts

S

Steven Bethard

I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:

py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:

py> def select(dicts, n):
.... result = []
.... counts = {}
.... def doadd(d):
.... for label, count in d.iteritems():
.... if counts.get(label, 0) + count > n:
.... return False
.... return True
.... for d in dicts:
.... if doadd(d):
.... result.append(d)
.... for label, count in d.iteritems():
.... counts[label] = counts.get(label, 0) + count
.... return result
....

Basically, I use a greedy approach -- adding a dict each time if I can.
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select function
produces such a suboptimal solution:

py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --
then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right now
is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!

STeVe

P.S. No, I'm not just making these problems up! I'm sick, but not that
sick. ;) It has to do with resampling a set of data... If you're
really interested, I can provide the full details, but they'll only make
the problem even _more_ complicated. =)
 
J

John Machin

Steven said:
I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:
Basically, I use a greedy approach -- adding a dict each time if I can.
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select function
produces such a suboptimal solution:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --
then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right now
is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!

This is more than a bit OT but you've sucked me in :)

You should specify a function that assigns a score to valid solutions;
it's not apparent from the wording above and the example exactly what
that function might be. Thinking through the alternatives might help
you solve the problem yourself :)
 
S

Steven Bethard

John said:
Steven said:
I have a list of dictionaries. Each dictionary holds counts of
various 'words'...
...
Basically, I use a greedy approach -- adding a dict each time if I
can.
...
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select
function produces such a suboptimal solution:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --
then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right
now is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!

You should specify a function that assigns a score to valid solutions;
it's not apparent from the wording above and the example exactly what
that function might be. Thinking through the alternatives might help
you solve the problem yourself :)

Yeah, I guess that's a good idea. Here's a scoring function that should
give the right range of scores (where 1.0 is perfect and 0.0 is failure):

py> from __future__ import division
py> def score(origdicts, selecteddicts, n):
.... origwords = set(word for d in origdicts for word in d)
.... counts = {}
.... for d in selecteddicts:
.... for word, count in d.iteritems():
.... counts[word] = counts.get(word, 0) + count
.... if counts[word] > n:
.... return 0.0
.... return sum(counts.itervalues())/len(origwords)/n
....
py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
py> score(countdicts, countdicts, 12)
0.0
py> score(countdicts, select(countdicts, 12), 12)
0.75
py> score(countdicts, [dict(a=8, b=7), dict(a=4, b=5, c=12)], 12)
1.0

It's pretty easy to build a configuration that can't be solved with a
perfect score:

py> def subsets(lst):
.... if not lst:
.... yield []
.... else:
.... first = lst[:1]
.... for subset in subsets(lst[1:]):
.... yield subset
.... yield first + subset
....
py> countdicts = [
.... dict(a=1, b=2, c=3),
.... dict(a=4, b=5, c=6),
.... dict(a=7, b=8, c=9)]
py> for dicts in subsets(countdicts):
.... print score(countdicts, dicts, 9), dicts
....
0.0 []
0.222222222222 [{'a': 1, 'c': 3, 'b': 2}]
0.555555555556 [{'a': 4, 'c': 6, 'b': 5}]
0.777777777778 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
0.888888888889 [{'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c':
9, 'b': 8}]

Hmmm... I can't do an exhaustive search like this one here -- I have
~8000 dicts, so 2**N is definitely too large of a space to search. Now
that I have a scoring routine maybe I could do an A* search though...

Thanks for the input!

STeVe
 
B

Brian Beck

Steven said:
I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:

Not that you can't still improve performance of course, but this is an
NP-complete problem if you didn't know, so don't bang your head too hard...
 
S

Steven Bethard

Brian said:
Steven said:
I have a list of dictionaries. Each dictionary holds counts of
various 'words', e.g.:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:

Not that you can't still improve performance of course, but this is an
NP-complete problem if you didn't know, so don't bang your head too hard...

Yeah, it seemed a lot like one of those box-packing type problems, so if
it's NP-complete, it wouldn't surprise me. That's one of the reasons I
mentioned trying A* in one of my other responses. However, in
retrospect, I don't think I can apply A* because I don't really have a
"goal state". I could call a selection where all "words" had MAX_VALUE
counts a "goal state", but I'm not guaranteed that such a state always
exists (and if it doesn't A* just degenerates into an exhaustive search).

Anyway, do you know what name this problem is usually discussed under?
If I knew what to google for, I could probably find at least a few
simple heuristics to try...

Thanks again,

STeVe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top