How to test if one dict is subset of another?

J

Jay Tee

Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()


where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

Any suggestions? Constraint : the answer has to work for both python
2.2 and 2.3 (and preferably all pythons after that).

JT
 
D

Diez B. Roggisch

Jay said:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()


where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

This would still need to run over all items in jobs. No gain.
j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

If you're jobs dictionary is immutable regarding the key-set (not from
it's implementation, but from its usage), the thing you can do to
enhance performance is to create an index. Take a predicate like

def p(j):
return j.get('user') == 'jeff'

and build a list

jeffs_jobs = [j for j in jobs if p(j)]

Then you can test only over these. Alternatively, if you have quite a
few of such predicate/action-pairs, try and loop once over all jobs,
applynig the predicates and actions accordingly.

Diez
 
P

Peter Otten

Jay said:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()


where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

Any suggestions? Constraint : the answer has to work for both python
2.2 and 2.3 (and preferably all pythons after that).

Use a RDBMS (a database), they tend to be good at this kind of operations.

Peter
 
J

Jay Tee

Use a RDBMS (a database), they tend to be good at this kind of operations.

yeah, one of the options is metakit ... sqlite and buzhug both looked
promising but the constraint of pythons 2.2 and 2.3 ruled that out.
disadvantage of metakit is that it's not pure python, meaning possible
integration problems. the system has to be deployed at 200+ sites
worldwide on a mix of RHEL 3 and 4 based systems, with some Debian
clusters thrown in, and running real production ...

hence my desire to find a pure-python solution if at all possible.
it's looking grim.

JT
 
P

Peter Otten

Jay said:
yeah, one of the options is metakit ... sqlite and buzhug both looked
promising but the constraint of pythons 2.2 and 2.3 ruled that out.
disadvantage of metakit is that it's not pure python, meaning possible
integration problems. the system has to be deployed at 200+ sites
worldwide on a mix of RHEL 3 and 4 based systems, with some Debian
clusters thrown in, and running real production ...

hence my desire to find a pure-python solution if at all possible.
it's looking grim.

The following may speed up things a bit if you have enough memory and your
data is queried more often than changed.

import sys

def generate_id():
for i in xrange(sys.maxint):
yield i
raise ImplementationRestriction

get_id = generate_id().next

class Record(dict):
def __init__(self, *args, **kw):
dict.__init__(self, *args, **kw)
assert not hasattr(self, "_id")
self._id = get_id()
def __setitem__(self, key, value):
raise ImmutableException
def __hash__(self):
return self._id
def __str__(self):
items = self.items()
items.sort()
return ", ".join(["%s: %s" % p for p in items])

records = dict.fromkeys([
Record(user="jack", start=42, state="running"),
Record(user="jack", start=47, state="running"),
Record(user="jack", start=46, state="queued"),
Record(user="jane", start=42, state="running"),
Record(user="jane", start=7),
])

def fill_cache(records):
cache = {}
for record in records:
for p in record.iteritems():
cache.setdefault(p, {})[record] = None
return cache

_cache = fill_cache(records)

def select(*data, **kw):
[data] = data
result = data
for p in kw.iteritems():
try:
c = _cache[p]
except KeyError:
c = {}
if not c:
return {}
result = dict.fromkeys([k for k in result if k in c])
if not result:
break
return result

if __name__ == "__main__":
for filter in [
dict(user="jack"),
dict(state="running"),
dict(user="jack", state="running"),
dict(state="undefined"),
dict(user="jane", state="queued")
]:
print "--- %s ---" % Record(filter)
result = select(records, **filter)
if result:
for record in result:
print record
else:
print "#no matching records"
print

The above code runs with Python 2.3; don't know about 2.2 as I don't have it
on my machine. Of course with sets it would look better...

Peter
 
H

Hendrik van Rooyen

Jay Tee said:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

When you are gathering the data and building the lists - why do you not
simultaneously build a dict of running jobs keyed by the Jeffs?

- Hendrik
 
P

Paul Rubin

Jay Tee said:
for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

Sounds like you need some backing data structures, like indexes
in a database, e.g. (untested, uses the cool new defaultdicts of 2.5):

index = defaultdict(lambda: defaultdict(set))
for j in jobs:
for k in j's dict:
index[k][j.get(k)].add(j)

Now for
if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()

you'd just write:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()
 
J

Jay Tee

> do_something()

you'd just write:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()

Hi,

it looks very cool, except that one of the constraints mentioned is
that the solution has to work properly on pythons 2.2 and 2.3.
production systems, certified OS releases, that sort of stuff ...
can't just upgrade a few thousand machines worldwide for my little
program ;-)

Thanks ... I took the previous poster's suggestion (had also been
suggested earlier) and built cached versions of the job lists, indexed
on common queries, while building the master job list.

JT
 
P

Paul Rubin

Jay Tee said:
it looks very cool, except that one of the constraints mentioned is
that the solution has to work properly on pythons 2.2 and 2.3.

That thing doesn't really deeply depend on defaultdict, it's just
convenient. You can add a few more lines of code in the preprocessing
step to get around it.
Thanks ... I took the previous poster's suggestion (had also been
suggested earlier) and built cached versions of the job lists, indexed
on common queries, while building the master job list.

You can also cache queries with the scheme I suggested, and it lets
you make the cached lists for new queries quickly.
 
J

Jay Tee

Hi

your post had the following construct:
for j in (index['user']['jeff'] & index['state']['running']):
do_something()

but

Python 2.3.4 (#1, Oct 11 2006, 06:18:43)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
l1= [3, 4, 7, 2]
l2 = [2, 3]
l2 = [2, 3, 99]
l1 & l2
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for &: 'list' and 'list'

what am I missing?

Thanks

JT
 
P

Paul Rubin

Jay Tee said:
l1= [3, 4, 7, 2]
l2 = [2, 3]
l2 = [2, 3, 99]
l1 & l2
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for &: 'list' and 'list'

what am I missing?

They are sets, not lists.

from sets import Set as set # use in 2.3 and earlier

l1= set([3, 4, 7, 2])
l2 = set([2, 3])
l2 = set([2, 3, 99])
print l1 & l2
 
S

Steve Holden

Vishal said:
Is there an inbuilt library in Python which you can use to convert time in
seconds to hh:mm:ss format?
Thanks,
Vishal
Please don't ask a question by editing a reply to an existing thread:
your question is now filed on many people's computers under "How to test
if one dict us a subset of another".

Look at the strftime() function in the time module.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007
 
S

Steve Holden

Vishal said:
Is there an inbuilt library in Python which you can use to convert time in
seconds to hh:mm:ss format?
Thanks,
Vishal
Please don't ask a question by editing a reply to an existing thread:
your question is now filed on many people's computers under "How to test
if one dict us a subset of another".

Look at the strftime() function in the time module.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007
 
J

Jay Tee

They are sets, not lists.

from sets import Set as set # use in 2.3 and earlier

l1= set([3, 4, 7, 2])
l2 = set([2, 3])
l2 = set([2, 3, 99])
print l1 & l2

Thanks Paul, but:

bosui:~> python
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-20)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
ImportError: No module named sets
 
P

Paul Rubin

Jay Tee said:
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
ImportError: No module named sets

Hmm, well I think the sets module in 2.3 is written in Python, so you
could drop it into your application for use in 2.2. Better would be
to use the C version from 2.4 if you can. Or you could fake it with
dicts. Sets are really just dicts under the skin. Instead of
set([1,2,3]) you'd use {1:True, 2:True, 3:True}. To intersect
two of them (untested):

def intersection(d1,d2):
if len(d2) < len(d1):
# swap them so d1 is the smaller one
d1,d2 = d2,d1
r = {}
for k in d1.iterkeys():
if k in d2:
r[k] = True
return r

The d1,d2 swap is just an optimization, since access is constant-time
you want to scan the smaller dictionary to minimize the number of
lookups.
 
J

Jay Tee

Hi,

thanks! the code lift from 2.3 to 2.2 worked (thank Guido et al for
BACKWARDS COMPATIBILITY ;-)) ... unfortunately I was in a hurry to get
the release out since a colleague's cluster was croaking under the
load of the old, non-indexed version. Your solution is nicer looking
than mine, and leads to a less complex implementation. Rolling it
into the next release is going to have to wait a few weeks, I hope I
can remember it that long. Thanks!!

JT

Jay Tee said:
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
ImportError: No module named sets

Hmm, well I think the sets module in 2.3 is written in Python, so you
could drop it into your application for use in 2.2. Better would be
to use the C version from 2.4 if you can. Or you could fake it with
dicts. Sets are really just dicts under the skin. Instead of
set([1,2,3]) you'd use {1:True, 2:True, 3:True}. To intersect
two of them (untested):

def intersection(d1,d2):
if len(d2) < len(d1):
# swap them so d1 is the smaller one
d1,d2 = d2,d1
r = {}
for k in d1.iterkeys():
if k in d2:
r[k] = True
return r

The d1,d2 swap is just an optimization, since access is constant-time
you want to scan the smaller dictionary to minimize the number of
lookups.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,833
Latest member
BettyeMacf

Latest Threads

Top