Re: Copy-on-write tree data structure

From: Stephan Tobies (Stephan.Tobies_at_nokia.com)
Date: 03/09/04


Date: 09 Mar 2004 19:19:18 GMT

Brian Inglis wrote:
> On 01 Mar 2004 23:20:41 GMT in comp.lang.c.moderated, Stephan Tobies
> <Stephan.Tobies@nokia.com> wrote:
>
>
>>I am looking for a good data structure that could be used to represent
>>families of trees with shared sub-trees and copy-on-write semantics.
>>
>>On a very abstract level, I would like to have an API like this:
>>
>>Let Node be a suitably defined data structure. Then the following
>>functions shall be supported:
>>
>>void setNthChild(Node root, int n, Node subtree);
>>
>>// A call to this function should take the tree rooted at root and set
>>// its nth child to be the tree rooted at subtree.
>>// Later changes to subtree must not change the tree rooted at root.
>>
>>Node getNthChild(Node root, int n);
>>
>>// This function shall return the nth child of root (if it exists).
>>// Later modifications of subtree rooted at the returned node must not
>>// change the tree rooted at root.
>>
>>Of course, this could be implemented by creating copies both of the tree
>>rooted at subtree (for setNthChild) and the tree rooted at the returned
>>node (for getNthChild), but the costs for this will be prohibitive in my
>>application.
>>
>>Hence, I would like to implement this using a suitable copy-on-write
>>strategy that makes copies of shared subtrees only when a shared subtree
>>is modifies.
>>
>>Does anyone know a good data structure that would be suitable to be used
>>in this scenario?
>
>
> Not entirely clear on this, or if it's on topic (but the mod let it
> thru) so here's a possible approach, and maybe someone can improve on
> it, or come up with a better one.

Well, I didn't know a better place to post this, and it is about data
structures and I promise, I will implement in C ;-)

>
> Assuming this is an M-ary tree structure like a B-tree, you need to
> detect when a subtree is being shared and how many times, you need to
> detect if a subtree is shared during modification, and then you need
> to copy the shared part of the subtree before modifying it.
>
> You can detect when a subtree is being shared and how many times by
> adding a reference count into each subtree node and incrementing the
> ref count when it is added into a different tree node with your set()
> operation.
>
> When a tree is being modified, you can detect whether it is shared by
> checking the ref counts of the nodes passed thru, and keeping a
> pointer to the last shared subtree node seen, as that is the only part
> of the tree you will need to copy.

Yes, that was the kind of solution that I had figured out myself. Good
to see that somebody else would use the same approach so I cannot be too
far off.

> You can copy the shared subtree to the current tree by traversing all
> the nodes from the previously saved pointer, inserting those nodes
> into a newly created subtree, possibly creating more shared subtrees
> below the node to be modified if desired, decrementing the ref count
> in the original subtree node, zeroing the ref count in the new subtree
> node, then modifying the current tree to contain the new subtree node.
> That leaves unmodified trees still pointing to the unmodified subtree.

Actually, with some care, one does not have to copy the entire tree.
Once could make a copy of exactly the changing node and then use
pointers to the unchanged sub-trees below it. Of course, one will have
to increment the reference count of those nodes that now have additional
pointers to them.

This would reduce the necessary copying even further.

>
> A number of housekeeping details have been handwaved over, and the
> distinction between subtree nodes and pointers may not be totally
> clear at all points, but neither are all the details of exactly where
> nodes are stored in trees. Hopefully the approach is clear enough to
> criticize for other reasons. ;^>
>

One problem remains with this: the pointers being used 'inside' the tree
(which are probably directly pointers to the node data structure) and
the pointers that are given 'to the environement' need to have a
different level of indirection.

Of course, inside the trees we want to have direct pointers to the node
structures because we want to be able to navigate in the tree as fast as
possible.

On the other hand, pointers handed to the user need to have an extra
level of indirection added because we need to be able to change the
actual tree being pointed to without changing the pointer held by the user.

For example, when the user adds a new child to a tree that he holds a
pointer to, it should be transparent to the user if the tree can be
changed in place (because no other references are held anywhere) or if
it needs to be copied. In my opinion, this is best achieved with this
extra level of indirection, which allows to exachange the tree being
pointed to without the user noticing.

Now, I have the problem that I want to build this copy-on-write scheme
into an already exiting library, and the external interface _must_ not
change (it does not need to be binary compatible). The pointers that
users currently can access don't allow introspection in the value
structure at the moment, so changing what the pointers point to can
actually be done without problems - I could easily change the pointed-to
structure to add the extra indirection.

But the same type is used within the library and with introspection into
the pointed-to structure. Also, the API functions are also called from
within the library. That makes simply re-defining the type somewhat
difficult. Additionally, the API is huge and I don't want to go over all
functions and deal with the extra indirection...

Any ideas anyone?

BR

Stephan

-- 
comp.lang.c.moderated - moderation address: clcm@plethora.net


Relevant Pages

  • Re: Java
    ... >> exactly how you mean pointers in this more general sense either. ... if in ML you represent a simple data structure like a binary ... > shared between tree or within the same tree, ... ourselves to purely functional programming ...
    (comp.programming)
  • Copy-on-write tree data structure
    ... Let Node be a suitably defined data structure. ... // A call to this function should take the tree rooted at root and set ... // its nth child to be the tree rooted at subtree. ...
    (comp.lang.c)
  • Copy-on-write tree data structure
    ... Let Node be a suitably defined data structure. ... // A call to this function should take the tree rooted at root and set ... // its nth child to be the tree rooted at subtree. ...
    (comp.lang.cpp)
  • Re: Copy-on-write tree data structure
    ... >Let Node be a suitably defined data structure. ... this could be implemented by creating copies both of the tree ... >rooted at subtree and the tree rooted at the returned ... >strategy that makes copies of shared subtrees only when a shared subtree ...
    (comp.lang.c)
  • Re: Java
    ... > exactly how you mean pointers in this more general sense either. ... if in ML you represent a simple data structure like a binary ... shared between tree or within the same tree, ... concept of either a pointer or a prettified-pointer of some kind. ...
    (comp.programming)

Loading