python bijection

L

Lie Ryan

Where a list will do, use a list- duh. But when you need a graph, you
shouldn't have to homebrew an implementation any more than you
should have to homebrew an odict or named tuple, both of which
are substantially easier to get right than a graph is.

That's what I was trying to say, though I can't find a good way to. The
comparison with the Zen was to meant that if List is sufficient (if
Simple is sufficient) don't use Tree (don't design a Complex). I wasn't
implying to reduce all problem into a list at whatever costs, sorry if
my wording appears to imply so...
 
G

geremy condra

If you want a better example, consider various database schemas that
have one-to-one, one-to-many, many-to-one, and many-to-many
relationships.  I daresay these are very common.  All of these can be
represented by a non-directed graph.

Another common use of directed graphs is for dependency
relationships.  In practice, a lot of times running things in order of
dependency is done by assigning everything a scalar priotity, and
executing in order of priority.  This can work ok, but it's fragile.
If there's a graph type in Python maybe people will be encouraged to
handle dependencies explicitly.

I actually considered using dependencies as an example on the
"graphine for pythonistas"[1] article, but decided to do the maze
run instead. In any event, the uses of graphs in general computing
are well enough established that I don't really think that's where
the majority of the difficulty in coming up with something for the
standard library will be- deciding how it should look and behave,
especially in terms of scope and target audience, that's going to
be the difficult part.

Geremy Condra

[1]: http://gitorious.org/graphine/pages/GraphineForPythonistas
 
S

Steven D'Aprano

I think this could be an interpretation of the Zen:

Simple is better than complex.
Complex is better than complicated.

can be read as:
List is better than Tree

Because O(N) searches are better than O(log N) searches. Not.

How about "The right tool for the right job"?

Tree is better than Graph

not having Tree and Graph package in the standard library force most
people to find List-based solution.

If you have to be *forced* to use a list-based solution, that's a good
sign that a list is *not* the right tool for the job.
 
R

Raymond Hettinger

[Me]
[Francis Carr]
Solutions such as bidict just automate the two-dict approach.

They do so at the expense of implementing a new API to support it and
at the expense with having non-obvious behaviors (i.e. how it handles
collapsing two records into one, which exceptions can be raised, how
the bijection invariants are maintained, which of the two symmetric
accessors is the primary, etc). This is not a cost-free choice of
simply "automating something".

IMO, "just automating" something that is already clear is not
necessarily a net win. Each new class has a learning curve and it is
sometimes simpler to use the basic dictionary API instead of inventing
a new one. I would *much* rather debug code written by someone using
two dictionaries than code using any of the APIs discussed so far --
all of those hide important design decisions behind a layer of
abstraction. The API must be easy to learn and remember (including
all of it quirks); otherwise, it is a net mental tax on the programmer
and code reviewers.

Also, I've presented examples of usability problems with APIs that do
not require the user to specify names for the two directions. It is
too easy to make mistakes that are hard to see. Which is correct,
phonelist['raymond'] or phonelist[raymonds_phonenumber]? There is no
way to tell without looking back at the bijection specification. The
API must be self-documenting. An API that is error-prone is worse
than having no bijection class at all.

Further, the API needs to be consistent with the rest of the language
(no abusing syntax with the likes of phonelist[:number]).

Unfortunately, Mark Lemburg has thrown gas on this fire, so the
conversation will likely continue for a while. If so, it would be
helpful if the conversation centered around real-world examples of
code that would be improved with a bijection class. In my experience,
there are examples where bijections arise in programs but it doesn't
happen often enough to warrant a new class for it. In cases where it
does arise, people should try-out the proposed APIs to see if in-fact
the code is made more clear or whether simple work is just being
hidden behind a layer of abstraction. For grins, insert an error into
the code (conflating the primary key with the secondary key) and see
if the error is self-evident or whether it is hidden by the new API.
Also, test the API for flexibility (how well can it adapt to
injections and surjections, can it handle use cases with default
values, is there a meaningful interpretation of dict.fromkeys() in a
bijection, can a user specify how to handle violations of the
bijection invariants by raising exceptions, supplying default values,
collapsing records, etc.)

Even if a proposed API passes those smell tests, demonstrates the
required flexibility, is easy to learn and use, is not error-prone,
and actually improves real-world use cases, it is my opinion that a
recipe for it will not garner a fan club and that it will have limited
uptake. ISTM that it is more fun to write classes like this than it
is to actually use them.

Raymond


P.S. It also makes sense to look at other mature languages to see
whether they ever find a need to include a bijection class in their
standard library or builtin collections. If Smalltalk, Java, Forth,
Go, Io, Haskell and C++ couldn't be sold on it, perhaps the buyer
should beware. If some language is found that did include a bijection
class, then do a Google code search to see how it fared in the real-
world. My bet is that it either wasn't used much or that it actually
make code worse by making errors harder to spot.
 
R

Raymond Hettinger

  ...sqlite3 provides another way...
In many many cases, using a dB (even a lightweight such as sqlite3) is
swatting the fly with a sledgehammer :)

I'm sure it seems that way, but look at the generic description of the
problem: "I have a list of n-ary tuples with named fields and would
like to be able to retrieve a tuple using any of m-fields as a lookup
key (where 2 <= m <= n). Also, I would like to enforce the constraint
that those m-fields are all unique keys."

That pretty much sounds like a typical database problem featuring a
single table with multiple indexes. One can wish-away the complexity
of a database but you will end-up re-inventing an in-memory, one-table
version of database.

===============
phonelist
===============
name idnumber
---- --------
ray 1234
josh 5423
carl 8674
tery 5409
greg 3402
mark 2108

tasks
-----
list names
list idnumbers
find idnumber=2108
find name=greg
del name=tery
update idnumber=4321 where name=ray
list sorted by name
list sorted by idnumber reversed
is name=john in phonelist
Now, extend the table to make it a trijection (three unique keys),
perhaps a social security number.
Now, extend the table to add a field that isn't unique (not a key
field), perhaps a person's favorite language.
Oh wait, you can't do that with a bijection API.
Now, forget the in-memory part and make it persistent on disk.
Now, relate the entries to another dictionary of tuples (i.e. another
table with foreign-key).
Too bad a DB wasn't used to solve a DB problem.


Raymond
 
L

Lie Ryan

If you have to be *forced* to use a list-based solution, that's a good
sign that a list is *not* the right tool for the job.

Sorry for putting too much emphasis on the "forced", that wasn't my
intention. I was mentioning that often simple problems appeared more
complex because the person was thinking in a higher level data structure
than is necessary. "forced" in that sentence means to let people to
think a little bit harder with list/dict before deciding that they are
unsuitable and moving to a tree or a graph.
 
G

geremy condra

Sorry for putting too much emphasis on the "forced", that wasn't my
intention. I was mentioning that often simple problems appeared more complex
because the person was thinking in a higher level data structure than is
necessary. "forced" in that sentence means to let people to think a little
bit harder with list/dict before deciding that they are unsuitable and
moving to a tree or a graph.

In any event, I think we can agree that for some tasks (I would say
many) a graph or tree is the most suitable data structure. To me,
that says that we need to be having a discussion about whether to
include those tools in the standard library, and how best to do so if
that's the decision. Would you agree?

Geremy Condra
 
R

Raymond Hettinger

[geremy condra]
I actually considered using dependencies as an example on the
"graphine for pythonistas"[1] article, but decided to do the maze
run instead. In any event, the uses of graphs in general computing
are well enough established that I don't really think that's where
the majority of the difficulty in coming up with something for the
standard library will be- deciding how it should look and behave,
especially in terms of scope and target audience, that's going to
be the difficult part.

Right you are :)


Do you have many users?


Raymond
 
G

geremy condra

[geremy condra]
I actually considered using dependencies as an example on the
"graphine for pythonistas"[1] article, but decided to do the maze
run instead. In any event, the uses of graphs in general computing
are well enough established that I don't really think that's where
the majority of the difficulty in coming up with something for the
standard library will be- deciding how it should look and behave,
especially in terms of scope and target audience, that's going to
be the difficult part.

Right you are :)


Do you have many users?

I literally have no idea, but I suspect not, seeing as how its a
pure python3 project. I get emails about it often enough that
I suspect we have about a hundred or so, though, which I'm
not unhappy with for a project that focuses on so basic a
data structure. Having said that, we always welcome more-
and I'd love to have your input on how to make its api more
pythonic.

Geremy Condra
 
R

Raymond Hettinger

[geremy condra]
I actually considered using dependencies as an example on the
"graphine for pythonistas"[1] article, but decided to do the maze
run instead. In any event, the uses of graphs in general computing
are well enough established that I don't really think that's where
the majority of the difficulty in coming up with something for the
standard library will be- deciding how it should look and behave,
especially in terms of scope and target audience, that's going to
be the difficult part.

I literally have no idea, but I suspect not, seeing as how its a
pure python3 project.

Google's code search can provide a clue.
It is also helpful because it let you see
typically use cases and whether the API fits well
or whether it is awkward to express common use cases.
All of those observations will help you tweak the API.

Also, it's useful to make similar studies of competing
packages either written in Python or in some other langauge.


Raymond
 
G

geremy condra

[geremy condra]
I actually considered using dependencies as an example on the
"graphine for pythonistas"[1] article, but decided to do the maze
run instead. In any event, the uses of graphs in general computing
are well enough established that I don't really think that's where
the majority of the difficulty in coming up with something for the
standard library will be- deciding how it should look and behave,
especially in terms of scope and target audience, that's going to
be the difficult part.

I literally have no idea, but I suspect not, seeing as how its a
pure python3 project.

Google's code search can provide a clue.
It is also helpful because it let you see
typically use cases and whether the API fits well
or whether it is awkward to express common use cases.
All of those observations will help you tweak the API.

Unfortunately, judging from the queries I get most of
the people interested in the package are using it to
study graphs, rather than to write applications using
them. That's one of the reasons why I haven't pushed
the idea to date- graphine was developed to be useful
to other developers, but seems to be mainly used by
academics, while networkx seems to be targeted
primarily at academics, and is somewhat widely used.
I think the ideal situation would be to take projects
like graphine and python-graph and develop a
hybrid system specifically for the standard library
that borrows the strengths of each, but that would
involve a lot of developer time for people already
heavily invested in their own projects. Having
said that, if enough people from the python
community got behind it, I think it would happen.
Also, it's useful to make similar studies of competing
packages either written in Python or in some other langauge.

I couldn't agree more.

Geremy Condra
 
G

geremy condra

I was as well, being one of them :)

Misunderstanding: 2, Me: 0... moving on ;)

How interested are you in a C port of graphine? I haven't had
any specific requests for it, but if its something you need I
can shuffle it towards the top of the to do pile.

Geremy Condra
 
G

geremy condra

There are two main reasons for a C implementation:

 1. performance

 2. memory footprint

These are important when using graphs with lots of nodes or
when using lots of graphs or operations on graphs in tight
loops.

However, to get such a new type into the core, you need to start
a PEP process and hash out the API some more, esp. with respect
to integration with the other core types, such as dictionaries
and lists of tuples for both creation and as alternative
representation.

I'm happy to start the PEP process, but will need some
guidance, as I've never written a PEP before.
Some other comments (in no particular order):

 * the "name" default of using id(node) is not ideal, since
  id(node) will return the address of the object and that
  can be subject to reuse within the lifetime of the process;
  using a global (thread-safe) counter would be safer

Good idea. I'm going to start a branch on the repo to
handle the changes you mention.
 * Graph.__init__ should be able to take a list or set
  of nodes and edges as initializer

The format of this will need to be thought all the way
through before being implemented. To date, we haven't
come up with anything completely satisfactory, but
AFAIK everybody involved is still open to suggestions
on this.
 * Graph.__setitem__ could be mapped to .add_node() for
  convenience

This may be a question of playing around with it. ATM I'm
not sold on the idea, but I'll implement it and see how it
turns out in practice.
 * Graph.__length__ could be mapped to .size() for
  convenience

We decided not to do this due to the ambiguity between
whether .size() or .order() was the intended operation,
and looking back I'm not sure that was entirely unjustified.
Do you see there being any confusion on that score?
 * Graph.__iter__ could be mapped to an iterator using
  the fastest traversal method for the graph nodes
  (ie. order does not matter, it's only important that
  all nodes are found as fast as possible)

Again, it seems ambiguous as to whether nodes or
edges are the intended target here, and while the
API can obviously dictate that, it seems a bit like
a case of "in the face of ambiguity, refuse the
temptation to guess" to me.
 * Graph could also benefit from some bulk update methods,
  e.g. to update weights on all edges or nodes by passing
  in a dictionary mapping names to attribute values

Sounds good. Again, the format will need some careful
thought, but you're right that this would improve it
greatly.
 * Graph could benefit from some bulk query methods,
  such as .get_node_attributes() and .add_edges().

I'm not sure how the first is different from the existing
..data, what did you have in mind?
 * Node.__getitem__ could be mapped to .data.__getitem__
  (following the approach taken by ElementTree); same
  for Edge.__getitem__
 * Node.__setitem__ could be mapped to .data.__setitem__
  (following the approach taken by ElementTree); same
  for Edge.__setitem__

..data is just a property that returns a dictionary of non
private members right now, so if you're wanting to just say
node.this = 'stuff', you can already do that. Or am I
misreading your intent?
 * I'd add a DirectedEdge alias for Edge and a new
  UndirectedEdge as subclass which has is_directed=False;
  makes code more readable

Sounds good, and thanks for the advice!

Geremy Condra
 
S

Steven D'Aprano

Again, it seems ambiguous as to whether nodes or edges are the intended
target here, and while the API can obviously dictate that, it seems a
bit like a case of "in the face of ambiguity, refuse the temptation to
guess" to me.

Consider dicts. They have methods that iterates over keys, values, and
(key, value) pairs, but if you just iterate over the dict itself, you get
the keys. This was a deliberate design decision.

It's not a matter of guessing in the face of ambiguity, but giving an
easy way to get the most common case. Assuming that iteration over nodes
is more common than iteration over edges, then iter(graph) should yield
nodes in some unspecified order. If you need the nodes in a particular
order, then you will call some method that guarantees the order, and if
you want the edges, you call a method that gives edges.

If you're not sure whether wanting nodes or edges will be more common,
then you should wait until you, or the community, has more real-world
experience with the module. Python had dicts for something approaching a
decade before they became directly iterable.
 
G

geremy condra

Consider dicts. They have methods that iterates over keys, values, and
(key, value) pairs, but if you just iterate over the dict itself, you get
the keys. This was a deliberate design decision.

It's not a matter of guessing in the face of ambiguity, but giving an
easy way to get the most common case. Assuming that iteration over nodes
is more common than iteration over edges, then iter(graph) should yield
nodes in some unspecified order. If you need the nodes in a particular
order, then you will call some method that guarantees the order, and if
you want the edges, you call a method that gives edges.

If you're not sure whether wanting nodes or edges will be more common,
then you should wait until you, or the community, has more real-world
experience with the module. Python had dicts for something approaching a
decade before they became directly iterable.

I don't have a problem with adding this if there's a strong desire for it,
but at the moment I'm leaning towards a wait-and-see approach, for
all the reasons you described.

Geremy Condra
 
G

geremy condra

Some pointers to get you started:

PEPs in general:
http://www.python.org/dev/peps/pep-0001/
http://www.python.org/dev/peps/pep-0009/

Adding modules to the standard lib:
http://www.python.org/dev/peps/pep-0002/

(though, in reality, you'd probably only be patching the
collections.py module, I guess)

Thanks, I'll go over these a little later.
Good idea. I'm going to start a branch on the repo to
handle the changes you mention.


The format of this will need to be thought all the way
through before being implemented. To date, we haven't
come up with anything completely satisfactory, but
AFAIK everybody involved is still open to suggestions
on this.

I wasn't thinking of anything clever :) ...

g = Graph(
     [Node("a"), Node("b"), Node("c")],
     [Edge(Node("a"), Node("b"), "ab"),
      Edge(Node("a"), Node("c"), "ac"),
      Edge(Node("b"), Node("c"), "bc"),
     ])

The main motivation here is to get lists, sets and dicts
play nice together.

Generally, we've tried to discourage people from instantiating
nodes and edges directly, in favor of having them controlled
through the graph. Maybe something along the lines of:

g = Graph(nodes=['a', 'b', 'c'], edges=[('a', 'b'), ('a', 'c'), ('b', 'c')])

?
Thinking about it some more, I agree, it's not all that useful.


There is an ambiguity here, indeed. My thinking was that
the edges define the graph and can be mapped to a dictionary,
e.g.

d = {"ab": ("a", "b"),
    "ac": ("a", "c"),
    "bc": ("b", "c")}

len(d) == 3
len(g) == 3

A graph without edges is also what you typically call an empty
graph, ie.

if not g:
   print 'empty'

http://mathworld.wolfram.com/EmptyGraph.html

Still, perhaps it's better to just not go down this route.

I'd rather avoid it if possible, since there are many
potential interpretations of this in different contexts.
Again, I'm open to to the idea, but not convinced.
Right, but sometimes "practicalty beats purity" ;-) We had
the same situation for dictionaries and then decided that
iteration over keys would be more natural than iterating over
items (key, value) or values.

It's also important to note that:

       for n in g: print n

and

       n in g

match up in terms of semantics.

Since n in g already uses the "node in graph" approach,
the for-loop should use the same logic.

Beware, that doesn't just match nodes.

g = Graph()
g.add_node('a')
g.add_node('b')
g.add_edge('a', 'b', 'ab')
'ab' in g # returns true
This is only an optimization, but could lead to some great
performance improvements by avoiding function call overhead.
Agreed.


The second one was a typo. It should have been
.get_edge_attributes().

In both cases I was thinking of being able to quickly
extract a number of node/edge attributes, e.g.

{"ab": {"weight": 3, "color": "blue"},
 "ac": {"weight": 2, "color": "green"},
 "bc": {"weight": 1, "color": "red"}}

Again, the idea is to reduce call overhead and later
on be able to move these lookups to C.

Entirely doable, but I'm not sure I see the use case.
Would you mind providing a more realistic example?
Right, but with attributes you are restricted to Python
identifiers, e.g. keywords (e.g. "from", "pass") are not allowed.
The above approach bypasses this restriction.

Hmm. Sounds like a plausible use case to me, although I'm
not sure its one that should be encouraged. The bigger
question in my mind is whether all attribute lookups should
have to pay the extra lookup cost to support a somewhat
narrow need. I'll definitely have to talk with the other devs
about this one, and run a little bit of timing besides.

Geremy Condra
 
G

geremy condra

Some pointers to get you started:

PEPs in general:
http://www.python.org/dev/peps/pep-0001/
http://www.python.org/dev/peps/pep-0009/

Adding modules to the standard lib:
http://www.python.org/dev/peps/pep-0002/

(though, in reality, you'd probably only be patching the
collections.py module, I guess)

Thanks, I'll go over these a little later.
Some other comments (in no particular order):

 * the "name" default of using id(node) is not ideal, since
  id(node) will return the address of the object and that
  can be subject to reuse within the lifetime of the process;
  using a global (thread-safe) counter would be safer

Good idea. I'm going to start a branch on the repo to
handle the changes you mention.

 * Graph.__init__ should be able to take a list or set
  of nodes and edges as initializer

The format of this will need to be thought all the way
through before being implemented. To date, we haven't
come up with anything completely satisfactory, but
AFAIK everybody involved is still open to suggestions
on this.

I wasn't thinking of anything clever :) ...

g = Graph(
     [Node("a"), Node("b"), Node("c")],
     [Edge(Node("a"), Node("b"), "ab"),
      Edge(Node("a"), Node("c"), "ac"),
      Edge(Node("b"), Node("c"), "bc"),
     ])

The main motivation here is to get lists, sets and dicts
play nice together.

Generally, we've tried to discourage people from instantiating
nodes and edges directly, in favor of having them controlled
through the graph. Maybe something along the lines of:

g = Graph(nodes=['a', 'b', 'c'], edges=[('a', 'b'), ('a', 'c'), ('b', 'c')])

?
Thinking about it some more, I agree, it's not all that useful.


There is an ambiguity here, indeed. My thinking was that
the edges define the graph and can be mapped to a dictionary,
e.g.

d = {"ab": ("a", "b"),
    "ac": ("a", "c"),
    "bc": ("b", "c")}

len(d) == 3
len(g) == 3

A graph without edges is also what you typically call an empty
graph, ie.

if not g:
   print 'empty'

http://mathworld.wolfram.com/EmptyGraph.html

Still, perhaps it's better to just not go down this route.

I'd rather avoid it if possible, since there are many
potential interpretations of this in different contexts.
Again, I'm open to to the idea, but not convinced.
Right, but sometimes "practicalty beats purity" ;-) We had
the same situation for dictionaries and then decided that
iteration over keys would be more natural than iterating over
items (key, value) or values.

It's also important to note that:

       for n in g: print n

and

       n in g

match up in terms of semantics.

Since n in g already uses the "node in graph" approach,
the for-loop should use the same logic.

Beware, that doesn't just match nodes.

g = Graph()
g.add_node('a')
g.add_node('b')
g.add_edge('a', 'b', 'ab')
'ab' in g # returns true
This is only an optimization, but could lead to some great
performance improvements by avoiding function call overhead.
Agreed.


The second one was a typo. It should have been
.get_edge_attributes().

In both cases I was thinking of being able to quickly
extract a number of node/edge attributes, e.g.

{"ab": {"weight": 3, "color": "blue"},
 "ac": {"weight": 2, "color": "green"},
 "bc": {"weight": 1, "color": "red"}}

Again, the idea is to reduce call overhead and later
on be able to move these lookups to C.

Entirely doable, but I'm not sure I see the use case.
Would you mind providing a more realistic example?
Right, but with attributes you are restricted to Python
identifiers, e.g. keywords (e.g. "from", "pass") are not allowed.
The above approach bypasses this restriction.

Hmm. Sounds like a plausible use case to me, although I'm
not sure its one that should be encouraged. The bigger
question in my mind is whether all attribute lookups should
have to pay the extra lookup cost to support a somewhat
narrow need. I'll definitely have to talk with the other devs
about this one, and run a little bit of timing besides.

Geremy Condra

Just as a note, I've made some of the changes we've been
talking about in a new branch over at graphine.org.

Geremy Condra
 
R

Robin Becker

geremy condra wrote:
............
I don't have a problem with adding this if there's a strong desire for it,
but at the moment I'm leaning towards a wait-and-see approach, for
all the reasons you described.

Geremy Condra

I don't want to sound pessimistic, but graph and digraph theory has a lot of
history, especially in computer science. There are already very many
implementations eg

http://code.google.com/p/igraph
http://www.boost.org/doc/libs/release/libs/graph
http://ernst-schroeder.uni.lu/Digraph/doc/
http://code.google.com/p/python-graph
http://compbio.washington.edu/~zach/py_graph/doc/html/public/py_graph-module.html

and many others......some of the above already seem to be very useful.

Is there reason to suppose that any one representation of graphs or digraphs is
so good we need to add it to python?

Even for fairly common algorithms eg Dijkstra's shortest path there doesn't seem
to be complete agreement on how to implement them; for the details of how
nodes/edges/paths should be stored and efficiently manipulated there is huge
variety.

Wait seems like a good policy.
 
S

Steven D'Aprano

On Tue, 08 Dec 2009 03:06:29 -0500, geremy condra wrote:

[snip 215 lines of quoted-quoted-quoted-quoted-quoted text]

In the future, would you mind trimming the unneeded quoting from your
post? There's no need to duplicate the *entire* conversation in *every*
post, and it is awfully AOL-like of you to have 215 lines of text, with
up to five levels of quotation, followed by two lines of new content.

Thank you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,967
Members
47,517
Latest member
Andres38A1

Latest Threads

Top