How to test if one dict is subset of another?

Jay Tee · Feb 19, 2007

Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()

where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

Any suggestions? Constraint : the answer has to work for both python
2.2 and 2.3 (and preferably all pythons after that).

JT

Diez B. Roggisch · Feb 19, 2007

Jay said:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()

where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

This would still need to run over all items in jobs. No gain.

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

If you're jobs dictionary is immutable regarding the key-set (not from
it's implementation, but from its usage), the thing you can do to
enhance performance is to create an index. Take a predicate like

def p(j):
return j.get('user') == 'jeff'

and build a list

jeffs_jobs = [j for j in jobs if p(j)]

Then you can test only over these. Alternatively, if you have quite a
few of such predicate/action-pairs, try and loop once over all jobs,
applynig the predicates and actions accordingly.

Diez

Peter Otten · Feb 19, 2007

Jay said:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()

where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

Any suggestions? Constraint : the answer has to work for both python
2.2 and 2.3 (and preferably all pythons after that).

Use a RDBMS (a database), they tend to be good at this kind of operations.

Peter

Jay Tee · Feb 19, 2007

Use a RDBMS (a database), they tend to be good at this kind of operations.

yeah, one of the options is metakit ... sqlite and buzhug both looked
promising but the constraint of pythons 2.2 and 2.3 ruled that out.
disadvantage of metakit is that it's not pure python, meaning possible
integration problems. the system has to be deployed at 200+ sites
worldwide on a mix of RHEL 3 and 4 based systems, with some Debian
clusters thrown in, and running real production ...

hence my desire to find a pure-python solution if at all possible.
it's looking grim.

JT

Peter Otten · Feb 19, 2007

Jay said:
yeah, one of the options is metakit ... sqlite and buzhug both looked
promising but the constraint of pythons 2.2 and 2.3 ruled that out.
disadvantage of metakit is that it's not pure python, meaning possible
integration problems. the system has to be deployed at 200+ sites
worldwide on a mix of RHEL 3 and 4 based systems, with some Debian
clusters thrown in, and running real production ...

hence my desire to find a pure-python solution if at all possible.
it's looking grim.

The following may speed up things a bit if you have enough memory and your
data is queried more often than changed.

import sys

def generate_id():
for i in xrange(sys.maxint):
yield i
raise ImplementationRestriction

get_id = generate_id().next

class Record(dict):
def __init__(self, *args, **kw):
dict.__init__(self, *args, **kw)
assert not hasattr(self, "_id")
self._id = get_id()
def __setitem__(self, key, value):
raise ImmutableException
def __hash__(self):
return self._id
def __str__(self):
items = self.items()
items.sort()
return ", ".join(["%s: %s" % p for p in items])

records = dict.fromkeys([
Record(user="jack", start=42, state="running"),
Record(user="jack", start=47, state="running"),
Record(user="jack", start=46, state="queued"),
Record(user="jane", start=42, state="running"),
Record(user="jane", start=7),
])

def fill_cache(records):
cache = {}
for record in records:
for p in record.iteritems():
cache.setdefault(p, {})[record] = None
return cache

_cache = fill_cache(records)

def select(*data, **kw):
[data] = data
result = data
for p in kw.iteritems():
try:
c = _cache[p]
except KeyError:
c = {}
if not c:
return {}
result = dict.fromkeys([k for k in result if k in c])
if not result:
break
return result

if __name__ == "__main__":
for filter in [
dict(user="jack"),
dict(state="running"),
dict(user="jack", state="running"),
dict(state="undefined"),
dict(user="jane", state="queued")
]:
print "--- %s ---" % Record(filter)
result = select(records, **filter)
if result:
for record in result:
print record
else:
print "#no matching records"
print

The above code runs with Python 2.3; don't know about 2.2 as I don't have it
on my machine. Of course with sets it would look better...

Peter

Hendrik van Rooyen · Feb 20, 2007

Jay Tee said:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

When you are gathering the data and building the lists - why do you not
simultaneously build a dict of running jobs keyed by the Jeffs?

- Hendrik

Paul Rubin · Feb 20, 2007

Jay Tee said:
for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

Sounds like you need some backing data structures, like indexes
in a database, e.g. (untested, uses the cool new defaultdicts of 2.5):

index = defaultdict(lambda: defaultdict(set))
for j in jobs:
for k in j's dict:
index[k][j.get(k)].add(j)

Now for

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()

you'd just write:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()

Jay Tee · Feb 20, 2007

> do_something()

you'd just write:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()

Hi,

it looks very cool, except that one of the constraints mentioned is
that the solution has to work properly on pythons 2.2 and 2.3.
production systems, certified OS releases, that sort of stuff ...
can't just upgrade a few thousand machines worldwide for my little
program ;-)

Thanks ... I took the previous poster's suggestion (had also been
suggested earlier) and built cached versions of the job lists, indexed
on common queries, while building the master job list.

JT

Paul Rubin · Feb 20, 2007

Jay Tee said:
it looks very cool, except that one of the constraints mentioned is
that the solution has to work properly on pythons 2.2 and 2.3.

That thing doesn't really deeply depend on defaultdict, it's just
convenient. You can add a few more lines of code in the preprocessing
step to get around it.

Thanks ... I took the previous poster's suggestion (had also been
suggested earlier) and built cached versions of the job lists, indexed
on common queries, while building the master job list.

You can also cache queries with the scheme I suggested, and it lets
you make the cached lists for new queries quickly.

Jay Tee · Feb 20, 2007

Hi

your post had the following construct:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()

but

Python 2.3.4 (#1, Oct 11 2006, 06:18:43)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

l1= [3, 4, 7, 2]
l2 = [2, 3]
l2 = [2, 3, 99]
l1 & l2

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for &: 'list' and 'list'

what am I missing?

Thanks

JT

Paul Rubin · Feb 20, 2007

Jay Tee said:
l1= [3, 4, 7, 2]
l2 = [2, 3]
l2 = [2, 3, 99]
l1 & l2

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for &: 'list' and 'list'

what am I missing?

They are sets, not lists.

from sets import Set as set # use in 2.3 and earlier

l1= set([3, 4, 7, 2])
l2 = set([2, 3])
l2 = set([2, 3, 99])
print l1 & l2

Steve Holden · Feb 20, 2007

Vishal said:
Is there an inbuilt library in Python which you can use to convert time in
seconds to hh:mm:ss format?
Thanks,
Vishal

Please don't ask a question by editing a reply to an existing thread:
your question is now filed on many people's computers under "How to test
if one dict us a subset of another".

Look at the strftime() function in the time module.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Steve Holden · Feb 20, 2007

Vishal said:
Is there an inbuilt library in Python which you can use to convert time in
seconds to hh:mm:ss format?
Thanks,
Vishal

Please don't ask a question by editing a reply to an existing thread:
your question is now filed on many people's computers under "How to test
if one dict us a subset of another".

Look at the strftime() function in the time module.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Jay Tee · Feb 20, 2007

They are sets, not lists.

from sets import Set as set # use in 2.3 and earlier

l1= set([3, 4, 7, 2])
l2 = set([2, 3])
l2 = set([2, 3, 99])
print l1 & l2

Thanks Paul, but:

bosui:~> python
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-20)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
ImportError: No module named sets

Paul Rubin · Feb 20, 2007

Jay Tee said:
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
ImportError: No module named sets

Hmm, well I think the sets module in 2.3 is written in Python, so you
could drop it into your application for use in 2.2. Better would be
to use the C version from 2.4 if you can. Or you could fake it with
dicts. Sets are really just dicts under the skin. Instead of
set([1,2,3]) you'd use {1:True, 2:True, 3:True}. To intersect
two of them (untested):

def intersection(d1,d2):
if len(d2) < len(d1):
# swap them so d1 is the smaller one
d1,d2 = d2,d1
r = {}
for k in d1.iterkeys():
if k in d2:
r[k] = True
return r

The d1,d2 swap is just an optimization, since access is constant-time
you want to scan the smaller dictionary to minimize the number of
lookups.

Jay Tee · Feb 20, 2007

Hi,

thanks! the code lift from 2.3 to 2.2 worked (thank Guido et al for
BACKWARDS COMPATIBILITY ;-)) ... unfortunately I was in a hurry to get
the release out since a colleague's cluster was croaking under the
load of the old, non-indexed version. Your solution is nicer looking
than mine, and leads to a less complex implementation. Rolling it
into the next release is going to have to wait a few weeks, I hope I
can remember it that long. Thanks!!

JT

Jay Tee said:
Jay Tee said:

Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
ImportError: No module named sets

Click to expand...

Hmm, well I think the sets module in 2.3 is written in Python, so you
could drop it into your application for use in 2.2. Better would be
to use the C version from 2.4 if you can. Or you could fake it with
dicts. Sets are really just dicts under the skin. Instead of
set([1,2,3]) you'd use {1:True, 2:True, 3:True}. To intersect
two of them (untested):

def intersection(d1,d2):
if len(d2) < len(d1):
# swap them so d1 is the smaller one
d1,d2 = d2,d1
r = {}
for k in d1.iterkeys():
if k in d2:
r[k] = True
return r

The d1,d2 swap is just an optimization, since access is constant-time
you want to scan the smaller dictionary to minimize the number of
lookups.

one enum is a subset of another	10	Jan 31, 2012
How do i get numberOfItemsHired to only accept 1-500 if it is outside those values error message should be displayed	10	Jul 5, 2024
How to start, if at all ?	2	Apr 18, 2022
how to detect that one XML is a subset of another one?	6	Apr 21, 2010
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
The Easiest way to check if one array is a subset of another	2	Aug 4, 2009
What code do I add / overwrite so that the ebDriver' object has no attribute 'find_element_by_css_selector error is gone ?	0	Sep 19, 2022
How to test if a given socket descriptor is valid?	6	Nov 4, 2011

How to test if one dict is subset of another?

Jay Tee

Diez B. Roggisch

Peter Otten

Jay Tee

Peter Otten

Hendrik van Rooyen

Paul Rubin

Jay Tee

Paul Rubin

Jay Tee

Paul Rubin

Steve Holden

Steve Holden

Jay Tee

Paul Rubin

Jay Tee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads