Intersection of lists/sets -- with a catch

James Stroud · Oct 18, 2005

Hello All,

I find myself in this situation from time to time: I want to compare two lists
of arbitrary objects and (1) find those unique to the first list, (2) find
those unique to the second list, (3) find those that overlap. But here is the
catch: comparison is not straight-forward. For example, I will want to
compare 2 objects based on a set of common attributes. These two objects need
not be members of the same class, etc. A function might help to illustrate:

def test_elements(element1, element2):
"""
Returns bool.
"""
# any evaluation can follow
return (element1.att_a == element2.att_a) and \
(element1.att_b == element2.att_b)

So my very naieve solution usually bears resemblance to the following code
(made inelegant by the irritating necessity to keep track of a counter):

for some_element in some_list:
i = 0
while i < len(another_list):
another_element = another_list
overlap = test_elements(some_element, another_element)
if overlap:
store_overlap(some_element, another_element)
# speed-up, same as i+=1
another_list.pop(i)
break
else:
i += 1
if not overlap:
store_unique_some(some_element)

After this loop, "identical" elements (according to the evaluation criteria)
from the two lists are stored somewhere, elements unique to 'some_list' are
stored somewhere else, and 'another_list' has only elements originally unique
to itself according to the evaluation criteria remaining. Of course this code
assumes that both original lists contained only unique elements within
themselves and were previously sorted based on the evaluation criteria (to
speed-up the loop).

So, one improvement I can think of is to use an iterator for the inner loop:

for some_element in some_list:
for another_element in another_list:
overlap = test_elements(some_element, another_element)
if overlap:
store_overlap(some_element,another_element)
# big problem here -- iterator gets lost
another_list.remove(another_element)
break
if not overlap:
unique_some.append(some_element)

But the problem is obvious, the iterator becomes out of sync with
'another_list'. An example can illustrate:

py> alist = [1,2,3,4]
py> for el in alist:
.... print el
.... if el == 3:
.... alist.remove(el)
....
1
2
3
py> alist
[1, 2, 4]

So, a question would be how one might "correct" the iterator in this
situation. Inspecting dir(iter) provides no clues as to the identity of an
iterator's internal counter.

Its probably obvious to everyone that this type of task seems perfect for
sets. However, it does not seem that sets can be used in the following way,
using a hypothetical "comparator" function. The "comparator" would be
analagous to a function passed to the list.sort() method. Such a device would
crush the previous code to the following very straight-forward statements:

some_set = Set(some_list, comparator=test_elements)
another_set = Set(another_list, comparator=test_elements)
overlaps = some_set.intersection(another_set)
unique_some = some_set.difference(another_set)
unique_another = another_set.difference(some_set)

I am under the personal opinion that such a modification to the set type would
make it vastly more flexible, if it does not already have this ability.

Any thoughts on how I might accomplish either technique or any thoughts on how
to make my code more straightforward would be greatly appreciated.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Carl Banks · Oct 18, 2005

James said:
Hello All,

I find myself in this situation from time to time: I want to compare two lists
of arbitrary objects and (1) find those unique to the first list, (2) find
those unique to the second list, (3) find those that overlap. But here is the
catch: comparison is not straight-forward. For example, I will want to
compare 2 objects based on a set of common attributes. These two objects need
not be members of the same class, etc. A function might help to illustrate:

def test_elements(element1, element2):
"""
Returns bool.
"""
# any evaluation can follow
return (element1.att_a == element2.att_a) and \
(element1.att_b == element2.att_b)

[snip]

Its probably obvious to everyone that this type of task seems perfect for
sets. However, it does not seem that sets can be used in the following way,
using a hypothetical "comparator" function. The "comparator" would be
analagous to a function passed to the list.sort() method. Such a device would
crush the previous code to the following very straight-forward statements:

some_set = Set(some_list, comparator=test_elements)
another_set = Set(another_list, comparator=test_elements)
overlaps = some_set.intersection(another_set)
unique_some = some_set.difference(another_set)
unique_another = another_set.difference(some_set)

I am under the personal opinion that such a modification to the set type would
make it vastly more flexible, if it does not already have this ability.

Any thoughts on how I might accomplish either technique or any thoughts on how
to make my code more straightforward would be greatly appreciated.

Howabout something like this (untested):

class CmpProxy(object):
def __init__(self,obj):
self.obj = obj
def __eq__(self,other):
return (self.obj.att_a == other.obj.att_b
and self.obj.att_b == other.obj.att_b)
def __hash__(self):
return hash((self.obj.att_a,self.obj.att_b))

set_a = set(CmpProxy(x) for x in list_a)
set_b = set(CmpProxy(y) for y in list_b)
overlaps = [ z.obj for z in set_a.intersection(set.b) ]

Carl Banks

George Sakkis · Oct 19, 2005

Carl Banks said:
Howabout something like this (untested):

class CmpProxy(object):
def __init__(self,obj):
self.obj = obj
def __eq__(self,other):
return (self.obj.att_a == other.obj.att_b
and self.obj.att_b == other.obj.att_b)
def __hash__(self):
return hash((self.obj.att_a,self.obj.att_b))

set_a = set(CmpProxy(x) for x in list_a)
set_b = set(CmpProxy(y) for y in list_b)
overlaps = [ z.obj for z in set_a.intersection(set.b) ]

Or more generally:

class Comparable(object):
def __init__(self, obj, key):
self.obj,self._key = obj,key
def __hash__(self):
return hash(self._key(self.obj))
def __eq__(self, other):
return self._key(self.obj) == self._key(other.obj)

if __name__ == '__main__':
some_list = ["hello", "world"]
another_list = ["HeLlO", "word"]
key = str.lower
some_set = set(Comparable(x,key) for x in some_list)
another_set = set(Comparable(x,key) for x in another_list)
print "overlaps:", [x.obj for x in some_set.intersection(another_set)]
print "unique_some:", [x.obj for x in some_set.difference(another_set)]
print "unique_another:", [x.obj for x in another_set.difference(some_set)]

George

Numpy Array of Sets	7	May 24, 2014
fastest way to find the intersection of n lists of sets	11	Apr 29, 2007
Multiple disjoint sample sets?	3	Jan 11, 2013
Rich comparison methods don't work in sets?	13	Jun 19, 2009
random.sample with large weighted sample-sets?	0	Feb 16, 2014
efficient intersection of lists with rounding	9	Dec 2, 2004
Help with sets	24	Oct 5, 2010
Help building a dictionary of lists	1	Nov 12, 2012

Intersection of lists/sets -- with a catch

James Stroud

Carl Banks

George Sakkis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads