matching objects by a tuple field criterion

  • Thread starter bullockbefriending bard
  • Start date
B

bullockbefriending bard

i have a large collection of python objects, each of which contains an
integer 6-tuple as part of its data payload. what i need to be able to
do is select only those objects which meet a simple tuple element
wildcard matching criterion. e.g. given the following python objects:

object A includes tuple (1,2,3,4,5,6)
object B includes tuple (1,4,4,4,11,1)
object C includes tuple (1,3,9,1,1,1)

all tuples are unique. for what it's worth, the values in each field
are independent of the other fields and range from 1 to 14. although
'large', my collection is sparse cf. the 14^6 possible tuples.

i want to search on *one only* tuple field/value. if my search
criterion is (*,*,*,4,*,*), then i want to find object A and object B.
if (1,*,*,*,*,*), i want to find objects A, B, and C, etc. i will only
ever specify an integer match for one tuple field.

i can think of some naive approaches, but is there an elegant way to
do this?
 
D

Diez B. Roggisch

bullockbefriending said:
i have a large collection of python objects, each of which contains an
integer 6-tuple as part of its data payload. what i need to be able to
do is select only those objects which meet a simple tuple element
wildcard matching criterion. e.g. given the following python objects:

object A includes tuple (1,2,3,4,5,6)
object B includes tuple (1,4,4,4,11,1)
object C includes tuple (1,3,9,1,1,1)

all tuples are unique. for what it's worth, the values in each field
are independent of the other fields and range from 1 to 14. although
'large', my collection is sparse cf. the 14^6 possible tuples.

i want to search on *one only* tuple field/value. if my search
criterion is (*,*,*,4,*,*), then i want to find object A and object B.
if (1,*,*,*,*,*), i want to find objects A, B, and C, etc. i will only
ever specify an integer match for one tuple field.

i can think of some naive approaches, but is there an elegant way to
do this?

Depends on what you find elegant. Are the criteria runtime-specified,
and is anything other than the * allowerd, e.g. [1,3,8]? IMHO the best
thing is to create a filter-function that you then use to... filter :)

Like this:

def create_filter_predicate(criteria):
"""
criteria is an iterable containing either
an '*' or a list of comma-separated
integers
"""
sub_preds = []
for i, sub_crit in enumerate(criteria):
if sub_crit == "*":
continue
matching_set = set(int(n) for n in sub_crit.split(","))
sub_preds.append((i, matching_set))
def predicate(o):
t = o.my_tuple
for i, ms in sub_preds:
if not t in ms:
return False
return True
return predicate

Diez
 
S

Steven D'Aprano

i have a large collection of python objects, each of which contains an
integer 6-tuple as part of its data payload. what i need to be able to
do is select only those objects which meet a simple tuple element
wildcard matching criterion. e.g. given the following python objects:

object A includes tuple (1,2,3,4,5,6)
object B includes tuple (1,4,4,4,11,1)
object C includes tuple (1,3,9,1,1,1)

all tuples are unique. for what it's worth, the values in each field
are independent of the other fields and range from 1 to 14. although
'large', my collection is sparse cf. the 14^6 possible tuples.

i want to search on *one only* tuple field/value. if my search
criterion is (*,*,*,4,*,*), then i want to find object A and object B.
if (1,*,*,*,*,*), i want to find objects A, B, and C, etc. i will only
ever specify an integer match for one tuple field.

i can think of some naive approaches, but is there an elegant way to
do this?

Instead of passing a wild-card tuple like (*,*,*,4,*,*) simply pass the
integer you want to match and the position you want to match it in.

As a generator:


def matcher(list_of_objects, what, where):
for obj in list_of_objects:
if obj.data[where] == what:
yield obj


As a generator expression:

(obj for obj in list_of_objects if obj.data[what] == where)
 
D

Diez B. Roggisch

Diez said:
bullockbefriending said:
i have a large collection of python objects, each of which contains an
integer 6-tuple as part of its data payload. what i need to be able to
do is select only those objects which meet a simple tuple element
wildcard matching criterion. e.g. given the following python objects:

object A includes tuple (1,2,3,4,5,6)
object B includes tuple (1,4,4,4,11,1)
object C includes tuple (1,3,9,1,1,1)

all tuples are unique. for what it's worth, the values in each field
are independent of the other fields and range from 1 to 14. although
'large', my collection is sparse cf. the 14^6 possible tuples.

i want to search on *one only* tuple field/value. if my search
criterion is (*,*,*,4,*,*), then i want to find object A and object B.
if (1,*,*,*,*,*), i want to find objects A, B, and C, etc. i will only
ever specify an integer match for one tuple field.

i can think of some naive approaches, but is there an elegant way to
do this?

Depends on what you find elegant. Are the criteria runtime-specified,
and is anything other than the * allowerd, e.g. [1,3,8]? IMHO the best
thing is to create a filter-function that you then use to... filter :)

Like this:

def create_filter_predicate(criteria):
"""
criteria is an iterable containing either
an '*' or a list of comma-separated
integers
"""
sub_preds = []
for i, sub_crit in enumerate(criteria):
if sub_crit == "*":
continue
matching_set = set(int(n) for n in sub_crit.split(","))
sub_preds.append((i, matching_set))
def predicate(o):
t = o.my_tuple
for i, ms in sub_preds:
if not t in ms:
return False
return True
return predicate


Obviously the docs are faulty - the criteria looks like this:

['*', '1,2,3']

But I think you can get the gist of it - regardless on how the actual
criteria are entered.

Diez
 
B

bullockbefriending bard

Instead of passing a wild-card tuple like (*,*,*,4,*,*) simply pass the
integer you want to match and the position you want to match it in.

for sure. that was more for expository purpose rather than how i was
planning to go about it.

As a generator expression:

(obj for obj in list_of_objects if obj.data[what] == where)

above or equivalent list comprehension was what i had in mind as far
as linear search goes. and scanning the list like this will most
likely be 'good enough' performance-wise. however, partly just out of
curiosity, i was wondering if there is some kind of data structure
which might let me find all the matches a bit faster.
 
D

Diez B. Roggisch

bullockbefriending said:
Instead of passing a wild-card tuple like (*,*,*,4,*,*) simply pass the
integer you want to match and the position you want to match it in.

for sure. that was more for expository purpose rather than how i was
planning to go about it.

As a generator expression:

(obj for obj in list_of_objects if obj.data[what] == where)

above or equivalent list comprehension was what i had in mind as far
as linear search goes. and scanning the list like this will most
likely be 'good enough' performance-wise. however, partly just out of
curiosity, i was wondering if there is some kind of data structure
which might let me find all the matches a bit faster.

You can of course create a tree from the tuples, where the first level
of nodes corresponds to the first attribute of the tuple and so forth.

There are certainly cases where the speedup is tremendous - think of a
single integer in the first criteria - but then the overall performance
depends on the real-live queries. If lot's of wildcards are used, you
might end up slower if the tree-walk takes more time than the
C-implemented list-iteration will cost you.

Diez
 
J

John Machin

i have a large collection of python objects, each of which contains an
integer 6-tuple as part of its data payload. what i need to be able to
do is select only those objects which meet a simple tuple element
wildcard matching criterion. e.g. given the following python objects:

object A includes tuple (1,2,3,4,5,6)
object B includes tuple (1,4,4,4,11,1)
object C includes tuple (1,3,9,1,1,1)

all tuples are unique. for what it's worth, the values in each field
are independent of the other fields and range from 1 to 14. although
'large', my collection is sparse cf. the 14^6 possible tuples.

i want to search on *one only* tuple field/value. if my search
criterion is (*,*,*,4,*,*), then i want to find object A and object B.
if (1,*,*,*,*,*), i want to find objects A, B, and C, etc. i will only
ever specify an integer match for one tuple field.

i can think of some naive approaches, but is there an elegant way to
do this?

You are going to have to tell us how many is in a "large" collection,
how often you will do this query, how much memory you have to spare,
how fast you want the answer back, whether you want the answer as a
generator, or in a list/tuple/set/whatever -- or you are going to have
to do a fair bit of prototyping and benchmarking.

Steven has given you one end of the spectrum: a naive serial scan. Now
I'll give you another end: a naive pre-build all possible answers
method:

[quite untested]
# build 14x6 array of empty sets
answer = [[set() for what in range(14)] for slot in range(6)]
# populate the sucker
for obj in obj_list:
for slot in range(6):
answer[slot][obj.data[slot]-1].add(obj)

Later, the answer to 'Which objects have the value 11 in slot 3?' is
answer[3][11-1]

HTH,
John
 
B

bullockbefriending bard

quite so, i rephrased docstring to be:

"""criteria is an iterable containing either '*' instances or strings
of comma-separated integers. e.g. ['*','1,2,3', '11,12']"""

thanks very much for the idea! upon further reflection, this seems to
be a more elegant solution for my case than the ad-hoc generator or
list comprehension approach. this is because i *do* have to filter
data based on multiple single field criteria and i know all of these
criteria at the same time - so it makes a lot of sense to do as you
have done and then only do one filter operation to pull out all the
objects i am interested in.
 
B

bullockbefriending bard

There are certainly cases where the speedup is tremendous - think of a
single integer in the first criteria - but then the overall performance
depends on the real-live queries. If lot's of wildcards are used, you
might end up slower if the tree-walk takes more time than the
C-implemented list-iteration will cost you.

hopefully, won't need to do this then. i'll test the filter approach
tonight. generally speaking, going to be mostly wildcards most of the
time.
 
J

John Machin

quite so, i rephrased docstring to be:

"""criteria is an iterable containing either '*' instances or strings
of comma-separated integers. e.g. ['*','1,2,3', '11,12']"""

thanks very much for the idea! upon further reflection, this seems to
be a more elegant solution for my case than the ad-hoc generator or
list comprehension approach. this is because i *do* have to filter
data based on multiple single field criteria and i know all of these
criteria at the same time - so it makes a lot of sense to do as you
have done and then only do one filter operation to pull out all the
objects i am interested in.

"""i want to search on *one only* tuple field/value."""
"""i will only ever specify an integer match for one tuple field."""
"""the cheque's in the mail"""
"""of course i'll still love you in the morning"""

So augment my pre-built approach by intersection (or union? your
wording is somewhat vague) of the selected answer sets.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top