relation class

A

Aaron Brady

Hi all,

I think Python should have a relation class in the standard library.
Fat chance. I want to write a recipe for it, but I don't know how. I
want your advice on some of the trade-offs, what it should look like,
what the pitfalls are, different strengths and weaknesses, etc.

Fundamentally, a relation is a set of tuples. A simple generator
expression or itertools filter can fulfill the requirements of a
query. But the fun stops there unless you want faster-than-linear
lookup times. I do.

For matching equality, you need per-column hash tables
(dictionaries). If I have a "WHERE lastname= 'Newton'" clause, I can
get all the matching records in constant time. For matching
inequality, you need a per-column sorted list (balanced tree). If I
have a "WHERE sales> 400" clause, I can get one matching record in log-
n time, and all of them in (log-n)+k time.

That's not the fun part either. What is the fun part, is the exact
nuance of query structure in a select statement. I want to return an
iterator from the following function calls.

recordset= Parts.select( [ "part" ], model== 'foopad' )
recordset= Sales.select( [ "model" ], sales>400 and sales<600 )
recordset= (Parts+Sales).select( [ "part" ], sales>400 and sales<600 )

The third of these is a join statement. It selects every part which
was in at least one model between 400 and 600 of which were sold. It
might need something more explicit, especially for the different types
of joins: 'Parts.join( Sales ).select' and 'Parts.innerjoin
( Sales ).select', or 'relation.innerjoin( Parts, Sales ).select'.

Unfortunately, even in this form, it won't work. While the statements
are valid expressions, the second argument is evaluated too soon. If
I put a lambda in front of it, I lose the ability to beat linear-time
lookups, and I might as well just use 'ifilter'. I want live Python
objects in the tuples and queries, so I can't just convert everything
to a string. What are my options? This is Python, so they can't all
be bad.

P.S. Recurring topic!
 
A

Aaron Brady

Perhaps I'm not understanding "relation" correctly, but are you not
aware ofhttp://docs.python.org/library/sqlite3.html?

Cheers,
Chris

Yes, I am aware of it. I should have mentioned. It is much the same
as what I am picturing, especially with the special ':memory:' name
for creating in RAM.

It only supports numbers and strings, which makes it appropriate for
persistence but not in-memory objects. It is also very advanced
compared to what I am picturing.

Its strategy for the syntax of the select statement is to use a
string, followed by the arguments for interpolation. How could you
use it to map, say, tree nodes back and forth to their parents?

There are many data structures which are in undirected relations to
others.
 
A

Aaron Brady

snip
It only supports numbers and strings.
snip

My point is that in undirected relations, it's redundant to maintain
reciprocal membership. Each participant is a member of the other.
'child.parent.child is child'. It's not a difficult redundancy to
maintain, but it isn't true to the concept.

Here is another sample:

Tree= Relation( ( "parent", "child", "direction" ) )
nodes= [ object( ) for _ in range( 10 ) ]
Tree( ( nodes[ 0 ], nodes[ 1 ], "left" ) )
Tree( ( nodes[ 0 ], nodes[ 2 ], "right" ) )

Here is the query (*non-functional)*:

recordset= Tree.select( [ "child" ],
"parent is nodes[0] and direction=='left'" )

Here's an alternative*:

recordset= Tree.select( [ "child" ],
lambda record: record.parent is nodes[0] and
record.direction=='left' )

I want some more concise ways of retrieving the same data. Those two
don't even work as is.*

recordset= Tree.select( operator.and_(
operator.is_( 'parent', nodes[0] ),
operator.eq( 'direction', 'left' ) ) )

This is evaluated too soon. The comparisons need to be
'functools.partial' functions*.

recordset= Tree.select(
partial(
operator.and,
partial(
operator.is_,
'parent',
nodes[0]
),
partial(
operator.eq,
'direction',
'left'
)
)
)

This is close, but the partial functions just compare strings. They
would have to be replaced with specialized functions that can
recognize designated strings as field names, and replace them with
actual fields.

Furthermore, it doesn't address the awkwardness of prefix notation,
which is required for function calls that aren't native operators.

I think the first is the closest. 'ast' makes it particularly
convenient, though maybe not trivial, to parse the selection. Where
it sees an 'ast.Name' node, it can replace it with an actual field
name. Here it is again*:

recordset= Tree.select( [ "child" ],
"parent is nodes[0] and direction=='left'" )

The only problem is passing variables in to the expression. By the
time 'nodes[0]' would get evaluated, it would be too late to know what
it is, without dirty frame stackery. 'select' would need the context
of its caller, which isn't available.

I think some kind of markers would have to replace any identifiers it
would use:

recordset= Tree.select( [ "child" ],
"parent is %var and direction=='left'", nodes[0] )

It's counterintuitive, but it's better than entirely prefix notation.
This way the function gets all the actual objects it needs, and can
still take shortcuts in its search.

I don't really see the argument for only returning a subset of fields,
instead of the entire record where matched, although I may not have
experience enough to know the downside, such as in the case of large
joins. It could be optional, defaulting to 'select *'.

recordset= Tree.select(
"parent is %var and direction=='left'", nodes[0] )

Since select returns an iterator, it's probably wise to include a
'selectone' function, like the 'Cursor.fetchone' function in
'sqllite3'. Of course, one could just add a '.next()' call to the
front of the query.

record= next( Tree.select(
"parent is %var and direction=='left'", nodes[0] ) )

or

record= Tree.selectone(
"parent is %var and direction=='left'", nodes[0] )

Does anyone see a better way? Would anyone have use for it? Do you
agree that it is more conceptually true to data structures like
trees? Is the optimization worth a specialized syntax? Even if
selection used a linear-time search, would encapsulating relations be
an improvement over redundant and cyclic attributes? Are the
equivalents in existing packages just as light-weight and functional?
That is, is it competitive?

Furthermore, is anyone developing something which uses a quoted
expression in a custom evaluation? I'd be interested in seeing those
results too. I understand that languages like LISP and Logix (
http://www.livelogix.com/logix/ ) are suited for it, but I want
something that is just a single class. Is it outside the scope of
Python's capabilities to do?
 
A

Aaron Brady

snip
It only supports numbers and strings.

snip

My point is that in undirected relations, it's redundant to maintain
reciprocal membership. snip
Here is another sample:

Tree= Relation( ( "parent", "child", "direction" ) )
nodes= [ object( ) for _ in range( 10 ) ]
Tree( ( nodes[ 0 ], nodes[ 1 ], "left" ) )
Tree( ( nodes[ 0 ], nodes[ 2 ], "right" ) ) snip
'select' would need the context
of its caller, which isn't available.
snip

Or, pass 'locals()'. Actually, I discovered an interesting catch to
'locals()'. When you don't use a variable that's in an outer scope in
a function, it doesn't appear in 'locals()'. However, if you just use
it in a blank statement, it magically appears.
.... x= []
.... def g( ):
.... print( locals( ) )
.... g( )
........ x= []
.... def g( ):
.... x # empty use of 'x'
.... print( locals( ) )
.... g( )
....{'x': []}

Here is what the docs have to say about *that*:

"Free variables are returned by locals() when it is called in a
function block."

Since 'x' doesn't appear in the function, 'g' in this case, it isn't
included in 'g's free variables. 'eval' actually therefore doesn't
return exactly the result of evaluating a string where it's used.
.... x= []
.... def g( ):
.... print( eval( 'x' ) )
.... g( )
....Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in f
File "<stdin>", line 4, in g
File "<string>", line 1, in <module>
NameError: name 'x' is not defined

The 'eval' function docs do assert otherwise though:

"...the expression is executed in the environment where eval() is
called."

'eval' doesn't obviously keep its contract as shown.
I think some kind of markers would have to replace any identifiers it
would use:

recordset= Tree.select( [ "child" ],
    "parent is %var and direction=='left'", nodes[0] )
snip

The 'sqlite3' module just uses a question mark. Despite the fact that
variables are separated from exactly where they are used, it has the
advantage, as S. D'Aprano pointed out nearby, of being able to use
queries independently of them, as well as of being legal Python.

If anyone is still reading, there is a further nifty-cool construct
that the 'Relation' class can offer. Given the earlier definition of
'Tree' as an instance of 'Relation', a new class can combine property
descriptors with the select statement.

class TreeNode:
relation= Tree
left= Relation.getter(
"child", "parent is ? and direction=='left'" )

nodes= [ TreeNode( ) for _ in range( 10 ) ]
record= nodes[ 0 ].left

Getting the 'left' member of an instance of this class is just
syntactic sugar for a long-hand query. It only returns one field of
an arbitrary element of the recordset. It only allows one '?' in the
definition, since getter functions only take one argument: self!
Here's the definition of 'getter':

@staticmethod
def getter( field, where ):
def _getter( self ):
return getattr( next(
self.relation.select( '*', where, self ) ), field )
return property( fget= _getter )

It could be an attribute of the module, not specifically of the
Relation class... which might even be advisable. You could even
combine the 'relation' member into the call to 'getter', so 'getter'
wouldn't have to retrieve it as 'self.relation'. That has the
disadvantage of increasing repetition of the relation in the class
statement, but enables a class to participate in multiple relations.
Come to think of it, you could write it as 'Tree.getter', and
compromise. Then 'getter' becomes:

def getter( self, field, where ):
def _getter( ob ):
return getattr( next(
self.select( '*', where, ob ) ), field )
return property( fget= _getter )

and 'TreeNode' becomes:

class TreeNode:
left= Tree.getter(
"child", "parent is ? and direction=='left'" )

You might want 'getter', 'getiter', and 'getone' methods, which return
a set of, an iterator over, and just the field of an arbitrary one of
the matching records respectively.

Proof-of-concept isn't even complete; but nonetheless, pretty cool,
eh?
 
A

Aaron Brady

My point is that in undirected relations, it's redundant to maintain
reciprocal membership. snip
Here is another sample:
Tree= Relation( ( "parent", "child", "direction" ) )
nodes= [ object( ) for _ in range( 10 ) ]
Tree( ( nodes[ 0 ], nodes[ 1 ], "left" ) )
Tree( ( nodes[ 0 ], nodes[ 2 ], "right" ) ) snip
'select' would need the context
of its caller, which isn't available.

snip

Or, pass 'locals()'. snip
I think some kind of markers would have to replace any identifiers it
would use:
recordset= Tree.select( [ "child" ],
    "parent is %var and direction=='left'", nodes[0] )
snip
snip
class TreeNode:
    relation= Tree
    left= Relation.getter(
        "child", "parent is ? and direction=='left'" ) snip
It only allows one '?' in the
definition, since getter functions only take one argument: self!

This isn't true. The query could have multiple parameter arguments.
Since 'self' isn't defined when calling the property creation
function, the question mark will need a special flag, or an argument
will have to be a sentinel object.

left= Tree.getter(
"child", "parent is ?self and direction==?", "left" )

Or:

left= Tree.getter(
"child", "parent is ? and direction==?", self_mark, "left" )

On doing my homework, I find parameters can be named as well:

left= Tree.getter(
"child", "parent is :self and direction==:dir",
{ 'dir': "left" } )

'self' would be reserved.

snip
You might want 'getter', 'getiter', and 'getone' methods, which return
a set of, an iterator over, and just the field of an arbitrary one of
the matching records respectively. snip

It isn't clear that a bare iterator is the proper tool. Taking
'columns' to be fields, and 'rows' to be records, we have four cases
of the return from a select statement.

- One row, one column. 'getter' and 'setter' can take a single-object
value. 'deller' merely deletes the record.
- One row, multiple columns. 'getter' and 'setter' return and accept
dictionaries which specify the values of the columns in the given
record. 'deller' deletes the record. It's possible to create nested
properties which correspond to the unspecified fields, but you may
lose identifier fidelity. Is a dictionary prohibitively inconvenient?
- Multiple rows, one column. The return from 'SELECT' should have
iterator semantics, as well as set operations. 'getter' returns the
object, but no 'setter' or 'deller' descriptors are meaningful. The
'add' method on the iterator/set should insert a new record that
fulfills the rest of the statement along with the value added; there
is one degree of freedom in the query. 'discard', 'remove', 'pop',
etc., should all do analogous things. The methods may have to be
implemented from scratch, due to the fact that they are short-hand for
operations on the relation with larger tuples, as well as
corresponding entries in the hashes and trees.
- Multiple rows, multiple columns. The return from 'SELECT' should be
an iterator/set. Arguments to the set operations should be
dictionaries as above.

The 'children' descriptor of the 'Tree' relation corresponds to the
last case:

class TreeNode:
children= Tree.getter(
"child", "parent is ?self" )

A call to 'list( nodeA.children )' would return (simulated,
formatted):
[
{ 'child': nodeB, 'direction': 'left' },
{ 'child': nodeC, 'direction': 'right' }
]

'nodeA.children' would be a non-hashing set, which set's members would
be dictionaries. The dictionaries would either have to be read-only,
or specialized to pass changes to them on to the structure. Two
sample operations on the 'children' iterator/sets:

nodeA.children.discard( { 'child': nodeB, 'direction': 'left' } )

To merely discard the left node or the 'nodeB' node would be better
suited to a different property or query ('del nodeA.left'). I don't
expect that the set operations would be used much in the 'multi-
column, multi-row' case; they could be susceptible to abuse, and may
be better omitted.

nodeA.children.clear( )

This removes both children from 'nodeA'. It may be eliminated from
the relation by now, though the get descriptor 'nodeA.children' will
still return an empty iterator/set object. Should it be the same one,
or volatile, like bound method objects?

The mention of 'hashes and trees' reminds me that not all types are
comparable in Python 3. This means each disjoint set of comparable
types will need its own separate tree, and the object's hash entry may
need to indicate privately what tree it's in. As you can see, columns
are freed from their static typing as SQL mandates they have. A
statement like, "SELECT * FROM Table WHERE X>100 OR X>objectA" would
perform two tree queries, one on the tree the elements of which can be
compared to 100, and the other on that which can be compared to
'objectA'. This is an optimization.

I am still operating on the premise that selection ('WHERE') clauses
must be simplified expressions, to permit the relation to run a fast
query. Perhaps it could be combined with special filters, such as a
function call.

prop= SomeRel.getter(
"fieldA", "fieldA> 600 and ?( ?self )", funcA )

Here, fieldA is compared first in a binary search, then 'funcA' would
be called on each remaining record. What are your thoughts? Should
the optimization break left-to-right precedence? That is, should
'fieldA' be filtered first, or 'funcA' called first if their order is
reversed?

prop= SomeRel.getter(
"fieldA", "?( ?self ) and fieldA> 600", funcA )

One obstacle is that '?' can't be filtered generically. The
alternatives are to use designated identifiers, and consequently
invalidate some field names. '?' could just be passed as a parameter
where needed [1]. Or, only inject an argument when the syntax takes
an identifier, that is, outside quotes.

[1] http://www.sqlite.org/c3ref/bind_parameter_name.html
http://www.sqlite.org/c3ref/bind_blob.html


Another obstacle is that where dictionaries and other mutable types
are present in a relation, the clauses 'WHERE X=={...}' and 'WHERE X
is {...}' would produce different results. (The notation is shorthand
for "select( 'X==?', {...} )".) The potential for objects to change
their values throws a wrench into their participation in hashes for
equality and identity, and trees for inequality. Not all objects can
participate in the optimization, and searches are back to good old
linear. It's not clear how to detect this on the fly: references to
members of the relation can exist outside its control. Several
strategies are merely to 'void the warranty' when the program mutates
participants; offer a 'update( ob= None )' method which must be called
upon mutations; offer additional arguments to 'add' methods, including
'__call__', which indicate whether the objects can be hashed and
'treed'; offer a 'unmutable' 'object decorator', e.g. "Tree( mutable
( nodeA ), immutable( nodeB ), 'left' )"; or follow the lead of,
oddly, the '__hash__' method and the 'bisect' module, which leave it
to the programmer to define an appropriate hash codes "...correctly
identified as unhashable...", and make no warranties about changing
contents once present in a container. That is, any item can
potentially participate in 'bisect', and the programmer can break
sorted order by mutating the object, and shouldn't use an inequality
'select' if s/he does. (In other words: The Usual(tm)!)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top