Data Structure Issue

A

Alexander Adam

Hello folks,

I got a few question on some basic data structure stuff. Sorry if
those questions might sound too easy or such but after googling a lot
I didn't find a real answer to all those questions so appreciate any
help.

I need to keep a very efficient data-tree structure (parent, previous,
next, first child, last child) within the memory which can be filled
up with billions of records. A struct place into the tree (I am using
the tree implementation of kp) is looking like this:

struct Node {
unsigned char type;
unsigned char code;
??? data;
??? unsigned char* content;
}

Now the first question is this -- the *data* property can be of a
single byte property or taking a full unsigned int or even more..
what'd be the best way to always keep the smallest size in memory?
(E.g. when having only a byte value then allocate only one byte in
memory for this property).

The next question is if it makes sense to allocate and deallocate the
content property within the constructor/destructor of the struct but I
fear that a constructor and destructor will add another couple of
bytes to each record in the memory.

The next issue I have is this -- the implementation I am using is
returning an iterator object, a function looks like this:

iterator parent(...) {}

Now what is this return value all about? I mean, its not a pointer so
it should be only valid until it goes out of scope within the parent
caller, is that assumptation correct? What I need is to keep the
iterator during lifetime and assign it to my objects so what's the
best way to keep it valid all time, assuming the std::list or tree
(whatever, they're compatible) gets inserts and delete statements a
lot?

Also there's a point I don't really understand. Say if I have
something like

std::list<Node> myList;

then what's the real difference in accessing an item either by

Node& node = *myList.begin()

or by

Node node = *myList.begin() ?

I've always thought that the first one would just get the reference to
the original struct the iterator is pointing to and the second one
creates a copy of it in the heap? If so, the first method should be
more efficient is that correct, too?

Basically, I need to obtain a most perfomant tree-like data structure
keeping a struct like the one mentioned before. What'd be the best way
to go, assuming that I have a lot of inserts, deletes and traversions
and that I need the parent, prev sibling, next sibling and first
child / last child access by hand without any time loss anytime?

Hopefully someone would be so kind to answer those questions in
detail, I'd appreciate very much to gather as much experience as
possible in good C++ Development.

Thanks & warm regards
Alexander
 
A

Alexander Adam

Hi!

Sorry, there's one thing I've forgotten to ask -- how can it be
avoided to have an additional allocation/deallocation when adding a
new record to my std::list or whatever container I am using? I mean,
doing something like

Node myNode;
....
myList.push_back(myNode);

will lead to create a myNode var on the heap, then making a copy of it
for the list and finally let it go out of scope. This seems quite too
inefficient to me though so is there a better way around except
creating your own pointer to Node?

Thanks!
Alexander
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Hello folks,

I got a few question on some basic data structure stuff. Sorry if
those questions might sound too easy or such but after googling a lot
I didn't find a real answer to all those questions so appreciate any
help.

I need to keep a very efficient data-tree structure (parent, previous,
next, first child, last child) within the memory which can be filled
up with billions of records. A struct place into the tree (I am using
the tree implementation of kp) is looking like this:

struct Node {
unsigned char type;
unsigned char code;
??? data;
??? unsigned char* content;
}

Type should probably be an enum, perhaps code also.
Now the first question is this -- the *data* property can be of a
single byte property or taking a full unsigned int or even more..
what'd be the best way to always keep the smallest size in memory?
(E.g. when having only a byte value then allocate only one byte in
memory for this property).

If all the nodes in the tree will have the same size of data then you
can parameterise Node to that type:

template<typename T>
struct Node {
// ...
T data;
// ...
};

By the way, what's the difference between data and content?
The next question is if it makes sense to allocate and deallocate the
content property within the constructor/destructor of the struct but I
fear that a constructor and destructor will add another couple of
bytes to each record in the memory.

Adding a constructor/destructor might add a few extra bytes to the code,
but should not affect the runtime size of a Node-object.
The next issue I have is this -- the implementation I am using is
returning an iterator object, a function looks like this:

iterator parent(...) {}

Now what is this return value all about? I mean, its not a pointer so
it should be only valid until it goes out of scope within the parent
caller, is that assumptation correct? What I need is to keep the
iterator during lifetime and assign it to my objects so what's the
best way to keep it valid all time, assuming the std::list or tree
(whatever, they're compatible) gets inserts and delete statements a
lot?

I guess that the iterator is a standard iterator to either an object in
a list or a map (or such) in which case the iterator will be valid so
long as the Node is still in the list. For standard containers which are
node-based the validity of iterators is not affected by any operations
on the container unless the operation affects the node the iterator
refers to.
Also there's a point I don't really understand. Say if I have
something like

std::list<Node> myList;

then what's the real difference in accessing an item either by

Node& node = *myList.begin()

or by

Node node = *myList.begin() ?

I've always thought that the first one would just get the reference to
the original struct the iterator is pointing to and the second one
creates a copy of it in the heap? If so, the first method should be
more efficient is that correct, too?
Yes.

Basically, I need to obtain a most perfomant tree-like data structure
keeping a struct like the one mentioned before. What'd be the best way
to go, assuming that I have a lot of inserts, deletes and traversions
and that I need the parent, prev sibling, next sibling and first
child / last child access by hand without any time loss anytime?

I'd probably do something like this:

template<typename T>
struct Node {
NodeType type; // enum
NodeCode code; // enum
Node* parent;
Node* previous;
Node* next;
Node* left; // left child
Node* right; // right child
T data;
unsigned char* content; // ???
};

and store them in a std::list<Node*>, this way inserts and deletes are
very cheap. However you still have to create the objects using new and
delete them when you are done. Also don't forget to fix any pointers in
other nodes when inserting and deleting. You might also want to use a
specialised allocator for the nodes.

For some usages this scheme might not be possible to use, but I don't
know how you are going to use the tree so I can't give anything but
general advice. It might also be possible to get rid of the std::list,
and just have the Nodes.
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Hi!

Sorry, there's one thing I've forgotten to ask -- how can it be
avoided to have an additional allocation/deallocation when adding a
new record to my std::list or whatever container I am using? I mean,
doing something like

Node myNode;
...
myList.push_back(myNode);

will lead to create a myNode var on the heap, then making a copy of it
for the list and finally let it go out of scope. This seems quite too
inefficient to me though so is there a better way around except
creating your own pointer to Node?

Actually this first creates a Node-object on the stack, then a copy will
be made (probably on the heap) by the list. And then the object on the
stack will go out of scope and destroyed. To avoid this insert just
pointers:

Node* myNode = new Node();

myList.push_back(myNode);

This way only a pointer will be copied. There are some drawbacks with
this scheme though, and you might want to use some kind of smart pointer.
 
J

James Kanze

I got a few question on some basic data structure stuff. Sorry if
those questions might sound too easy or such but after googling a lot
I didn't find a real answer to all those questions so appreciate any
help.
I need to keep a very efficient data-tree structure (parent, previous,
next, first child, last child) within the memory which can be filled
up with billions of records.

So space efficiency is more important than runtime efficiency.
A struct place into the tree (I am using
the tree implementation of kp) is looking like this:
struct Node {
unsigned char type;
unsigned char code;
??? data;
??? unsigned char* content;
}
Now the first question is this -- the *data* property can be of a
single byte property or taking a full unsigned int or even more..
what'd be the best way to always keep the smallest size in memory?
(E.g. when having only a byte value then allocate only one byte in
memory for this property).

It's hard to say. A struct cannot contain anything with a
variable size, so you may have to use an additional dynamic
allocation for it. (Don't forget, however, that dynamic
allocation has its own overhead---8 additional bytes, on my
machine. And of course, the struct must then contain at least a
pointer.)

One solution might be to use a union, e.g.:

union DataType
{
char chData ;
unsigned int uiData ;
???* otherData ;
} ;

You'll need to be able to discriminate from outside of the
union, but if e.g. type or code are enough to tell you what the
type of data is, then this would seem to be an appropriate
solution. The otherData field, of course, would have to be
dynamically allocated. (And with this, I'd suggest a set of
constructors and a destructor. And depending on how often the
Nodes are copied, perhaps reference counting for the otherData,
although deep copy will usually be simpler and sufficient.)
The next question is if it makes sense to allocate and deallocate the
content property within the constructor/destructor of the struct but I
fear that a constructor and destructor will add another couple of
bytes to each record in the memory.

Constructor and destructor typically won't add anything to the
object's footprint. Virtual functions will. But don't forget
the overhead for dynamic allocation. If at all possible, make
"content" a value member, and avoid allocating it as a separate
object at all.
The next issue I have is this -- the implementation I am using is
returning an iterator object, a function looks like this:
iterator parent(...) {}
Now what is this return value all about?

Return by value.
I mean, its not a pointer so it should be only valid until it
goes out of scope within the parent caller, is that
assumptation correct?

Whatever you're actually returning, yes. The semantics of
return by value is that a copy is made. Depending on what you
do with the return value, that copy is then copied to the final
destination. The standard explicitly allows the compiler to
assume that the copy constructor only copies, and the destructor
only destructs, and to elide any number of these copies; most
compilers do.
What I need is to keep the
iterator during lifetime and assign it to my objects so what's the
best way to keep it valid all time, assuming the std::list or tree
(whatever, they're compatible) gets inserts and delete statements a
lot?

That's largely up to the container. std::list guarantees that
its iterators are valid as long as the element they designate
isn't removed from the list. Other containers make weaker
guarantees---an insertion may invalidate an iterator into an
std::vector, for example. For non-standard containers, it's up
to the author of the container to decide what he wants to
guarantee, and what not, and to document it.
Also there's a point I don't really understand. Say if I have
something like
std::list<Node> myList;
then what's the real difference in accessing an item either by
Node& node = *myList.begin()
Node node = *myList.begin() ?
I've always thought that the first one would just get the reference to
the original struct the iterator is pointing to and the second one
creates a copy of it in the heap?

On the stack. (More correctly, wherever the compiler puts its
local variables. But the way scope works requires the semantics
of a stack, and all of the widespread implementations do
maintain a very simple stack, separate from the heap, and put
the local variables there.)

Other than that, you're correct. Note, however, that if you
make context part of the struct, and the data hasn't been
dynamically allocated, the copy will be very rapid. Not
anything to worry about, until the profiler says otherwise. I'd
base the choice on the desired semantics; if I just want a
value, the copy is best, but if I want to modify the value in
the list, I must use the reference.
If so, the first method should be
more efficient is that correct, too?

Maybe. It depends on how much copying the object costs.
Typically, unless the object is very big, there's no real
difference. Accessing through a reference is typically slightly
slower, so whatever you might gain by not copying, you loose in
the accesses. Unless the object is extremely expensive to copy,
you're wasting your time worrying about it. If it does turn out
to be a problem, it's easy enough to fix once the profiler says
you have to.
Basically, I need to obtain a most perfomant tree-like data structure
keeping a struct like the one mentioned before. What'd be the best way
to go, assuming that I have a lot of inserts, deletes and traversions
and that I need the parent, prev sibling, next sibling and first
child / last child access by hand without any time loss anytime?

Lots of pointers directly in the data structure:).

In my pre-standard DLList class, I used something like:

struct BasicNode
{
BasicNode* next ;
BasicNode* prec ;

BasicNode()
: next( this )
, prec( this )
{
}
} ;

template< typename T >
struct Node : public BasicNode
{
T data ;
} ;

All you really need to do is add a lot of additional pointers to
the BasicNode, and define the instantiation type for the Node
template to contain the necessary data (probablly using the
union above).

If this really is a one-off situation, you can forgo the
template. You might even forge the base class, and put the
pointers directly in the Node, although I like the separation of
concerns here (but note that it does involve some downcasting).
 
J

Jon Harrop

Alexander said:
I need to keep a very efficient data-tree structure (parent, previous,
next, first child, last child) within the memory which can be filled
up with billions of records.

That's a graph. If you want memory efficiency, remove all of the non-tree
pointers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top