Copy-on-write tree data structure

S

Stephan Tobies

[X-post and follow up]

Hi everyone,

I am looking for a good data structure that could be used to represent
families of trees with shared sub-trees and copy-on-write semantics.

On a very abstract level, I would like to have an API like this:

Let Node be a suitably defined data structure. Then the following
functions shall be supported:

void setNthChild(Node root, int n, Node subtree);

// A call to this function should take the tree rooted at root and set
// its nth child to be the tree rooted at subtree.
// Later changes to subtree must not change the tree rooted at root.

Node getNthChild(Node root, int n);

// This function shall return the nth child of root (if it exists).
// Later modifications of subtree rooted at the returned node must not
// change the tree rooted at root.

Of course, this could be implemented by creating copies both of the tree
rooted at subtree (for setNthChild) and the tree rooted at the returned
node (for getNthChild), but the costs for this will be prohibitive in my
application.

Hence, I would like to implement this using a suitable copy-on-write
strategy that makes copies of shared subtrees only when a shared subtree
is modifies.

Does anyone know a good data structure that would be suitable to be used
in this scenario?

Thanks in advace!

Best regards

Stephan Tobies
 
H

Howard

That sounds like an awful lot of truble to attempt to save some memory. I
think that you might find that your "cost" may become just as great with any
copy-on-write scheme, especially in complexity and time, but perhaps even in
memory requirements.

Think about how you might accomplish it. Suppose you had some structure
(list?) in which you stored the relationships between all nodes which may
later need to be copied and the trees to which they originally referred.
When you make a change to a node, you will have to copy the entire tree
structure below that node into a whole new tree. But that's not all. You
probably have to copy the whole tree to which that node belongs, since
changing a child node in a tree amounts to changing every node directly up
the tree that contains it, and other items in your list could refer to those
ancestor nodes.

So now you've got multiple copies of entire trees lying around, only some of
which may ever get changed. In the original scheme, you had more copies,
but they would often be of much smaller trees, since they only need to copy
the subtree that gets returned.

Is this homework??? I'm puzzled by your requirements. Have you done any
analysis of why your cost is too high to make copies as needed? And why do
you have to return "immutable" copies? What do you do with them? Do they
ever get disposed, or do they hang around until the app quits?

You asked for a "data structure". Sounds like you need a whole design. Are
you asking for the structure of the items in the list I mentioned? Well,
that really depends on the structure of your nodes, and how you identify
them. (By ID, by pointer address?) I think that if you come up with a
working design, the structure will present itself.


-Howard
 
U

user

Howard said:
That sounds like an awful lot of truble to attempt to save some memory. I
think that you might find that your "cost" may become just as great with any
copy-on-write scheme, especially in complexity and time, but perhaps even in
memory requirements.

I don't think that the overhead is actually so large, when this is done
properly. And the overhead of repeatedly copying the trees when building
them bottom up could be a real killer.

Assuming we want to build a full binary tree of depth n starting from
the leaves bottom up, each leave node is copied n-1 times, etc. This
means that the memory consumed by a tree of depth n built in this manner
w/o copy-on-write will be

1 * 2^0 + 2 * 2^2 + 3 * 2^3 + ... + n * 2^n

This is considerably larger that a tree build with copy on write, where
each node is allocated only once and hence it will consume 2^(n+1)-1 memory.

It might not matter asymptotically, but for medium-large n, it will make
a hell of a difference.

Think about how you might accomplish it. Suppose you had some structure
(list?) in which you stored the relationships between all nodes which may
later need to be copied and the trees to which they originally referred.
When you make a change to a node, you will have to copy the entire tree
structure below that node into a whole new tree. But that's not all. You
probably have to copy the whole tree to which that node belongs, since
changing a child node in a tree amounts to changing every node directly up
the tree that contains it, and other items in your list could refer to those
ancestor nodes.

Nope, I don't think that it is so hard to accomplish. A reference
counter at each node plus an extra level of indirection when giving out
a pointer to 'the world' will already do. I have found a neat solution
that will solve my problem very well, I think.
So now you've got multiple copies of entire trees lying around, only some of
which may ever get changed. In the original scheme, you had more copies,
but they would often be of much smaller trees, since they only need to copy
the subtree that gets returned.

The copy-on-write scheme will never make more copies than the
copy-directly scheme, so if the overhead to implement the copy-on-write
is not too large, it will alsways pay off.
Is this homework??? I'm puzzled by your requirements. Have you done any
analysis of why your cost is too high to make copies as needed? And why do
you have to return "immutable" copies? What do you do with them? Do they
ever get disposed, or do they hang around until the app quits?

I wish it was ;-) - no this is a real problem for the implementation of
the structure value handling for the run time system of a programming
language. There is a standard that specifies how external systems can
build values to be be passed to the run time system. This standard
requires the immutability of the values.
You asked for a "data structure". Sounds like you need a whole design. Are
you asking for the structure of the items in the list I mentioned? Well,
that really depends on the structure of your nodes, and how you identify
them. (By ID, by pointer address?) I think that if you come up with a
working design, the structure will present itself.

Yes, I think I have found a solution, which is really neat. If it works,
I might post it here.

Thanks for your inout!

Stephan
 
U

user

I don't think that the overhead is actually so large, when this is done
properly. And the overhead of repeatedly copying the trees when building
them bottom up could be a real killer.

Assuming we want to build a full binary tree of depth n starting from
the leaves bottom up, each leave node is copied n-1 times, etc. This
means that the memory consumed by a tree of depth n built in this manner
w/o copy-on-write will be

1 * 2^0 + 2 * 2^2 + 3 * 2^3 + ... + n * 2^n

BTW, I have done the math: in closed for this is (n-1) * 2^(n+1) + 2, so
it is even asymptotically worse than the 2^(n+1) - it is O(n * 2^n).

BR

Stephan
 
B

Brian Inglis

I am looking for a good data structure that could be used to represent
families of trees with shared sub-trees and copy-on-write semantics.

On a very abstract level, I would like to have an API like this:

Let Node be a suitably defined data structure. Then the following
functions shall be supported:

void setNthChild(Node root, int n, Node subtree);

// A call to this function should take the tree rooted at root and set
// its nth child to be the tree rooted at subtree.
// Later changes to subtree must not change the tree rooted at root.

Node getNthChild(Node root, int n);

// This function shall return the nth child of root (if it exists).
// Later modifications of subtree rooted at the returned node must not
// change the tree rooted at root.

Of course, this could be implemented by creating copies both of the tree
rooted at subtree (for setNthChild) and the tree rooted at the returned
node (for getNthChild), but the costs for this will be prohibitive in my
application.

Hence, I would like to implement this using a suitable copy-on-write
strategy that makes copies of shared subtrees only when a shared subtree
is modifies.

Does anyone know a good data structure that would be suitable to be used
in this scenario?

Not entirely clear on this, or if it's on topic (but the mod let it
thru) so here's a possible approach, and maybe someone can improve on
it, or come up with a better one.

Assuming this is an M-ary tree structure like a B-tree, you need to
detect when a subtree is being shared and how many times, you need to
detect if a subtree is shared during modification, and then you need
to copy the shared part of the subtree before modifying it.

You can detect when a subtree is being shared and how many times by
adding a reference count into each subtree node and incrementing the
ref count when it is added into a different tree node with your set()
operation.

When a tree is being modified, you can detect whether it is shared by
checking the ref counts of the nodes passed thru, and keeping a
pointer to the last shared subtree node seen, as that is the only part
of the tree you will need to copy.

You can copy the shared subtree to the current tree by traversing all
the nodes from the previously saved pointer, inserting those nodes
into a newly created subtree, possibly creating more shared subtrees
below the node to be modified if desired, decrementing the ref count
in the original subtree node, zeroing the ref count in the new subtree
node, then modifying the current tree to contain the new subtree node.
That leaves unmodified trees still pointing to the unmodified subtree.

A number of housekeeping details have been handwaved over, and the
distinction between subtree nodes and pointers may not be totally
clear at all points, but neither are all the details of exactly where
nodes are stored in trees. Hopefully the approach is clear enough to
criticize for other reasons. ;^>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,142
Messages
2,570,818
Members
47,362
Latest member
eitamoro

Latest Threads

Top