Database specialized in storing directed graphs?

C

Carl Banks

I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. But then what happens if
something changes elsewhere. How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.


Carl Banks
 
C

Chris Rebert

I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. But then what happens if
something changes elsewhere. How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.

By sacrificing a goat at the altar of the Almighty Google, I was able
to locate a project I came upon a long while ago but couldn't remember
the name of that's vaguely like what you want, in that it's a "graph
database": Neo4j - http://neo4j.org/ (and yes, it's in Java; sigh)
Not sure it's exactly what you're looking for, but anyway....

Cheers,
Chris
 
M

M.-A. Lemburg

I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. But then what happens if
something changes elsewhere. How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.

Aaron Watters is an expert on this and has implemented kjbuckets
for doing this in memory:

http://gadfly.sourceforge.net/kjbuckets.html

Gadfly uses the library to implement relational queries (and works
on disk):

http://gadfly.sourceforge.net/

The package is now maintained by Richard Jones.

You might be able to reuse some parts of Gadfly for your
purposes.

Also have a look at Pygr:

http://bioinfo.mbi.ucla.edu/pygr

which is a Python library to build a graph interface on top of
a relational database.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 28 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
 
G

George Sakkis

I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain.  The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them.  Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one.  But I'm hoping for
specialzed behavior useful for graphs.

If you're looking for FOSS, the Boost graph library [1] or its
parallel extension [2] is probably your best bet; it also comes with
Python bindings but they are not maintained any more. For commercial
solutions, Star-P [3] seems an interesting platform, with bindings to
Matlab and Python. Freebase [4] is apparently built on a special graph
database but unfortunately only the stored data are available, not the
DB source code.

George

[1] http://www.boost.org/doc/libs/1_36_0/libs/graph/doc/index.html
[2] http://www.osl.iu.edu/research/pbgl/
[3] http://www.interactivesupercomputing.com/success/sparsematrices.php
[4] http://www.freebase.com/help/faq#q7
 
B

bearophileHUGS

Sorry Carl Banks for the answering delay, there are problems in Google
Groups.
This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

It doesn't sound a problem for modern PCs.
I think you can keep the whole graph topology in RAM, and the node
data on disk (for example in a file or in DB). The topology is
represented by the arcs (you have said nothing regarding arc data, so
I assume it's absent) and the nodes (in RAM you just need a 32 bit
unsigned integer to represent the index of the node that is stored on
disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16
millions different nodes) for the nodes, but generally the memory
required for the nodes is very little compared to the memory necessary
to store the arcs).

You haven't said how many arcs there are, total or average for node,
and if such arcs are directed or undirected.

Anyway, using my Graph class (that stores each arc twice), this takes
about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node):

from graph import Graph
from random import randrange
g = Graph()
N = 1000000
g.addNodes(xrange(N))
for i in xrange(N * 10):
g.addArc(randrange(N), randrange(N))


You have said "could easily have millions of nodes", and every arc may
have tens of arcs or more.
("arbitrarily large" is an unsolvable problem because there are always
limits in your program, there's no program that's able to work on an
"arbitrarily large" dataset), so Python data structures become too
much costly for the RAM. With a lower level language like D/C/C++ you
can manage a bigger graph. You can use Boost Graph, but a homemade
graph structure may suffice too for your purposes.

For the problem you have explained I think a very simple graph
representation may suffice: an array of integer pairs (starting_index,
len) (starting_index can be a pointer too), where len is the number of
outbound arcs of the node n, plus an array of indexes/pointers that
list the outbound arcs. If memory gets tight you can split this second
array in two, and use an array of bytes for the lengths (if you have
more than 256 outbound arcs you may need a short). Note that if you
use indexes then a Python array.array (or Numpy) suffices.

In this situation if nnodes = 10_000_000 and narcs/node = 40 (total
nodes = 40 * 10_000_000):
nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is
often available on modern PCs.

On a 64 bit machine the indexes take the same memory, but pointers
twice that.

In "memory saving" mode:
nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes.

A more handy compromise is:
nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes.
But then what happens if something changes elsewhere.

Regarding the data structure, if you use arrays like I have explained,
if your updates aren't much frequent then you can allocate small extra
arrays to store more arcs coming out a node (but to do this you
probably may enjoy using pointers instead of indexes). When you have a
lot of updated you can save all to disk, and then regenerate the whole
data structure.

Bye,
bearophile
 
F

flyingfrog

Don't really know if this will be useful but i'd try pytables:
http://www.pytables.org/moin
it deals very well with every kind of hierarchical data sets, doesn't
matter the size.
It will let you load only significant data, and you'll be able to
query your data.
It's built on top of HDF5 libraries but exposes a very friendly
pythonic interface.
Surely you'll still have to implement all of the graph logic yourself
but this could be a good starting point.
Hope it helps

Marco
 
A

Aaron Brady

I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain.  The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them.  Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one.  But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A.  It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A.  I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that.  I would want instead to somehow cache the outside connectedness
relationship between different nodes of A.  But then what happens if
something changes elsewhere.  How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.

Carl Banks

Are you just looking for a persistent graph? The usual options are
'shelve', 'sqllite', 'mmap', and 'ctypes.Structure'. Or did you need
a special structure for the 'outside connectedness' properties?

Storing that is the usual time-space tradeoff. (Actually, I think
that term refers to something slightly distinct. This would be more
properly termed read-time/write-time trade off, roughly.)

Assuming you pick an RDB, you'd have a table of vertices and a table
of edges. Did you want 'set A' to be a table and persist between runs
of the program? If not, just keep a set of 2-tuples of nodes that
satisfy that property. I think the set S you're after is: all x,y
such that x is a vertex in A, y is a vertex in A, and there exists a P
such that P is a path, P starts at x, P ends at y, and vertex v in P
implies that v is x, v is y, or v is not in A. I don't know if you
can cache any information about A that gives you any shortcuts in
calculating S, or what the running time or running space of the
fastest/smallest algorithm is. At worst, it's O( |V| * |E| ) on each
change to V or E. Unverified.

You might be able to store the shortest path in a mapping from 2-
tuples to paths. Or better yet, map each node in A to the set of all
nodes reachable from it only entering A once. Then, you might save
time on either adding a node/vertex to A or G, or removing one from A
or G, or both. Maybe mapping each node in G to that relative set.
You didn't say whether the graph was directed and/or cyclic.

A good place to start would be, if you add a node to G, can S
decrease? No, because the original path P still exists. If you
remove one from A, still in G, S can stay the same size, might
decrease. If you remove a node from G, not in A, same, etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,828
Latest member
LauraCastr

Latest Threads

Top