Serialization in C

D

Debajyoti Sarma

How to do serialization in C ? Question arises from serialization of
binary tree.
 
M

Malcolm McLean

How to do serialization in C ? Question arises from serialization of
binary tree.
Very hard to do well. Really serialization is so fundamental it should
be built into the language.

A tree can be serialised as a set of nested parentheses, if the
requirement is to save as text. Look up Newick tree format.

However there's no easy was to write a generalised serialise function
for a general tree containing arbitrary data.
 
B

bart.c

Malcolm said:
Very hard to do well. Really serialization is so fundamental it should
be built into the language.

So fundamental that I don't think I've ever had to do it. I'm not even sure
what it means (flattening?).
A tree can be serialised as a set of nested parentheses, if the
requirement is to save as text. Look up Newick tree format.

However there's no easy was to write a generalised serialise function
for a general tree containing arbitrary data.

Sounds like building it in (to C) would be tricky then. Especially when the
language doesn't yet have trees built-in.
 
D

Debajyoti Sarma

ok.
Can't we store number of structure variables in file?
I thing that will do .... by putting 2 separate tree traversal in the
file....we can reconstruct the tree.
 
B

Ben Bacarisse

Vincenzo Mercuri said:
Debajyoti Sarma ha scritto:

a quick search on google gave me 3 main results:

http://tpl.sourceforge.net/
but I can't say how they work or if they fully
support storage of nested structures.

Depends on what you mean by "nested". TPL does not handle any data
organisation that uses pointers so linked lists and tree have to be done
by hand. In effect, TPL will do the leaves, but that's about it.
 
M

Malcolm McLean

So fundamental that I don't think I've ever had to do it. I'm not even sure
what it means (flattening?).
It means a data structure has to be stored as an ordered sequence of
bytes. So "flattening" is a perfectly good term.

Some people never need to do this, beyond writing out structs to a
file. However if you do need it, it is absolutely crucial.

Some examples:
a binary file format requires that floating point numbers be embedded
in IEEE format, whilst the floating point unit on the host processor
isn't IEEE.
a parallel system requires structures to be passed between processors
in the form of messages.
a tree representing evolutionary relationships neeeds to be saved to
disk.
 
V

Vincenzo Mercuri

Ben Bacarisse ha scritto:
Depends on what you mean by "nested". TPL does not handle any data
organisation that uses pointers so linked lists and tree have to be done
by hand. In effect, TPL will do the leaves, but that's about it.

Yes with "nested" I meant nodes with sub-nodes, etc..
I've only taken a hasty look at the homepages,
I can't suggest them actually since I don't know them
 
E

Eric Sosman

ok.
Can't we store number of structure variables in file?
I thing that will do .... by putting 2 separate tree traversal in the
file....we can reconstruct the tree.

You can certainly write data to files, and read it back again.
It is up to you to choose a file format suited to your needs. You
mentioned (in some context that's been snipped away) that you're
interested in writing a binary tree to a file, which means you need
to find a way to embed a non-linear data structure in the file's
linear sequence. There are lots of ways to do this; see TAOCP 2.3.3
"Other Representations of Trees" for some ideas.

C will allow you to write a pointer to a file and read it back
again, but that's not useful unless you can guarantee that the thing
it points to is at the same memory location as in the original -- a
very difficult thing to arrange if you wrote the file in one program
and are trying to read it back in another, or even in another execution
of the same program. So if you choose one of the tree representations
that involves links, you'll probably want to replace the pointers by
something else for storage in the file, and then reconstitute actual
pointers from the "something else's" when you read it back again.
One way (there are many) would be to give each node an ID number,
possibly related to its position in the file: 1,2,3,... for the
first, second, third,... nodes, say. Then instead of writing the
pointer, you'd write the ID number of the pointed-to node, or 0 for
a NULL. When reading, you'd translate the ID numbers back to
pointers again for in-memory storage.
 
W

Willem

Eric Sosman wrote:
) C will allow you to write a pointer to a file and read it back
) again, but that's not useful unless you can guarantee that the thing
) it points to is at the same memory location as in the original -- a
) very difficult thing to arrange if you wrote the file in one program
) and are trying to read it back in another, or even in another execution
) of the same program. So if you choose one of the tree representations
) that involves links, you'll probably want to replace the pointers by
) something else for storage in the file, and then reconstitute actual
) pointers from the "something else's" when you read it back again.
) One way (there are many) would be to give each node an ID number,
) possibly related to its position in the file: 1,2,3,... for the
) first, second, third,... nodes, say. Then instead of writing the
) pointer, you'd write the ID number of the pointed-to node, or 0 for
) a NULL. When reading, you'd translate the ID numbers back to
) pointers again for in-memory storage.

Another trick is to allocate one large block of memory for the whole tree,
and use indexes instead of pointers for the links.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
E

Ersek, Laszlo

However there's no easy was to write a generalised serialise function
for a general tree containing arbitrary data.

A coarse idea (not easy, indeed):

1) I'll suppose the original structure types all have an initial member
that leads to a static const struct of function pointers. That is, I'll
assume all structure types already support "do something with me"
operations that can be called back from a tree traversal.

struct base_ops;

struct base
{
/* whatever */
const struct base_ops *ops;
};

struct type1
{
struct base base;
/* whatever */
};

struct type2
{
struct base base;
/* whatever */
};


static struct base_ops base_ops
{
int (*print_nicely)(FILE *stream, const struct base *base);
};


Suppose we have a tree node like this:

struct rbt_node
{
void *data;
struct rbt_node *parent, *left, *right;
enum rbt_color color;
};

I'll assume that we already have mechanisms in place that allow us to
traverse a tree and print each linked-to object nicely.


2) Extend the base_ops struct with

int (*serialize)(FILE *stream, const struct base *base);


The generic walker function might be declared like this:

int
walk(const struct rbt_node *root,
int (*visitor)(const void *node_data, void *workspace));

and

int
my_visitor(const void *node_data, void *workspace)
{
const struct base *base;
FILE *f;

base = node_data;
f = workspace;

return (*base->ops->serialize)(f, base);
}


3) The "virtual" serialize() function should dump each base-derived
structure. The output should probably start with a fixed type identifier
(so that factory (deserialization) functions can work later), then the
members. A single serialize() function could invoke further serialization
functions.


4) Each single object should be dumped exactly once. Members that are
allocated within a structure ("composition") should be dumped in-place,
members that are linked-to (separate malloc(), with ownership
responsibility, "aggregation") should be recursively followed.
"Navigation" pointers shouldn't be followed at all, but they have to be
serialized nonetheless. Function pointers should be serialized as
"function identifier strings".


5) The serialized form of each object should contain the unique identifier
of that object. Navigation pointers should be serialized as the unique
identifiers of the linked-to objects. On most platforms, converting the
pointers and addresses to (void *) and printing them with %p should work.


6) Deserialization would happen in two phases. The first phase would
consist of a loop:

- read type identifier token, advance stream
- call type-specific factory function that parses the object and advances
the stream.

Constructed objects are added to (linked to by) a random-access
dictionary, addressible by the serialized unique object identifier (eg.
the originally valid %p strings).

Once all objects are deserialized, the second phase would occur (on top of
the now complete dictionary), revisiting each object exactly once,
updating all "navigation links" (pointer members without ownership
responsibility) via dictionary lookups.

Finally, the dictionary is thrown away.


Sorry if this is a mess...

lacos
 
E

Ersek, Laszlo

static struct base_ops base_ops
{
int (*print_nicely)(FILE *stream, const struct base *base);
};

Ahem, this is not C, but you know what I meant.

struct base_ops
{
int (*print_nicely)(FILE *stream, const struct base *base);
};

static const struct base_ops type1_ops = { &type1_print_nicely },
type2_ops = { &type2_print_nicely };

Sorry for the short circuit, I posted in a rush.

lacos
 
P

Paul N

How to do serialization in C ? Question arises from serialization of
binary tree.

The C++ FAQ at http://www.parashift.com/c++-faq-lite/ has a whole
chapter devoted to serialization. Obviously some of the information
won't be directly relevant, but it's probably a good start. And many
of C++'s features can be simulated in C without too much difficulty.
 
G

Gene

How to do serialization in C ? Question arises from serialization of
binary tree.

C is a canvas where you can paint any of the methods that a larger
language may have built in. Below is how you might roughly emulate
Ada.Serial_IO, for example (lots of error checking omitted). Whether
you choose XML, lisp S-expressions, line-oriented, or whatever
representation, the principle is the same.

To get a general purpose facility, it's probably simplest to include a
small integer type tag in each data structure. The tags index a table
of vtbl pointers, and the vtbls have the custom serializer and
deserializer for each type. The various schemes you see to embed vtbl
pointers in the structs themselves at a common location all seem to
have portability issues (not?).

#include <stdio.h>
#include <stdlib.h>

typedef int DATA;

typedef struct node_s {
struct node_s *lft, *rgt;
DATA val;
} NODE;

#define NULL_TAG 0
#define DATA_TAG 1

void serialize(FILE *f, NODE *tree)
{
static char null_tag[1] = { NULL_TAG };
static char data_tag[1] = { DATA_TAG };
if (tree) {
fwrite(data_tag, 1, 1, f);
fwrite(&tree->val, sizeof tree->val, 1, f);
serialize(f, tree->lft);
serialize(f, tree->rgt);
}
else
fwrite(null_tag, 1, 1, f);
}

NODE *deserialize(FILE *f)
{
char tag;
NODE *tree;

fread(&tag, 1, 1, f);
switch (tag) {
case NULL_TAG:
return 0;
case DATA_TAG:
tree = malloc(sizeof *tree);
fread(&tree->val, sizeof tree->val, 1, f);
tree->lft = deserialize(f);
tree->rgt = deserialize(f);
return tree;
}
}

void print(NODE *tree)
{
if (tree) {
printf("(");
print(tree->lft);
printf(" [%d] ", tree->val);
print(tree->rgt);
printf(")");
}
else
printf("-");
}

NODE *new(DATA val, NODE *lft, NODE *rgt)
{
NODE *r = malloc(sizeof *r);
r->val = val;
r->lft = lft;
r->rgt = rgt;
return r;
}

int main(void)
{
FILE *f;
NODE *tree =
new(1,
new(2,
new(3,
new(4,
new(5, 0, 0),
new(6, 0, 0)),
new(7,
new(8, 0, 0),
0)),
0),
new(9,
new(10, 0, 0),
0));

printf("before: ");
print(tree);
printf("\n");

f = fopen("tree.ser", "wb");
serialize(f, tree);
fclose(f);

f = fopen("tree.ser", "rb");
tree = deserialize(f);
fclose(f);

printf("after: ");
print(tree);
printf("\n");

return 0;
}

$ ./a.out
before: (((((- [5] -) [4] (- [6] -)) [3] ((- [8] -) [7] -)) [2] -) [1]
((- [10] -) [9] -))
after: (((((- [5] -) [4] (- [6] -)) [3] ((- [8] -) [7] -)) [2] -) [1]
((- [10] -) [9] -))
$

Cheers.
 
G

Gene

To get a general purpose facility, it's probably simplest to include a
small integer type tag in each data structure.  The tags index a table
of vtbl pointers, and the vtbls have the custom serializer and
deserializer for each type.  The various schemes you see to embed vtbl
pointers in the structs themselves at a common location all seem to
have portability issues (not?).
Sorry. Of course type tags have the same issues as vtbl pointers. Is
there a portable way to reference common content in different struct
types, or is a gi-normous union (with its inherent dead memory and
compilation issues) the only way?
 
M

Marc

Après mûre réflexion, Debajyoti Sarma a écrit :
How to do serialization in C ? Question arises from serialization of
binary tree.

Many ways, but rather boring when you have to manage a large number of
types. I often used for this purpose the XDR standard for external data
encoding because a tool named rpcgen automates the generation of the
serialization/deserialization code from the definition of types in a
C-like language (the C types definition are also generated
automatically from this definition).

XDR support complex data types linked via pointers provided that there
is no loop. Moreover, it is easy in many cases to support reference
loops by using intermediate integer indexes.
XDR is a very stable standard that emerged about 20 years ago. For
example, common protocols like NFS and NIS/NIS+ are based on it ...

Marc
 
E

Ersek, Laszlo

Sorry. Of course type tags have the same issues as vtbl pointers. Is
there a portable way to reference common content in different struct
types, or is a gi-normous union (with its inherent dead memory and
compilation issues) the only way?

C99 6.7.2.1 Structure and union specifiers, p13

----v----
[...] A pointer to a structure object, suitably converted, points to its
initial member (or if that member is a bit-field, then to the unit in
which it resides), and vice versa. [...]
----^----

About the same is written under C89 6.5.2.1.

lacos
 
G

Gene

Sorry.  Of course type tags have the same issues as vtbl pointers. Is
there a portable way to reference common content in different struct
types, or is a gi-normous union (with its inherent dead memory and
compilation issues) the only way?

C99 6.7.2.1 Structure and union specifiers, p13

----v----
[...] A pointer to a structure object, suitably converted, points to its
initial member (or if that member is a bit-field, then to the unit in
which it resides), and vice versa. [...]
----^----

About the same is written under C89 6.5.2.1.

lacos- Hide quoted text -

- Show quoted text -

The problem is the "suitably converted" business. If you don't know
the exact type of an object (which is after all the reason for wanting
to read the type tag or vptr), there is no way to convert the pointer
to that type! If we have:

typedef struct base_s {
int tag;
} BASE:

typedef struct point_s {
int tag;
double x, y, z,
} POINT;

typedef struct name_s {
int tag;
char name[10];
} NAME;

We would like to be able to pass round BASE* p; and say

if (p->tag == POINT_TAG) ... handle a point ...
else if (p->tag == NAME) ... handle a name.

as kind of hand-rolled polymorphism.

But this method is non-portable because we'd have to say ((POINT*)p)-
tag to be sure we're getting the tag of a point, but that's
nonsense.

How do get this effect?

Gene
 
E

Ersek, Laszlo

typedef struct base_s {
int tag;
} BASE:

typedef struct point_s {
int tag;
double x, y, z,
} POINT;

typedef struct name_s {
int tag;
char name[10];
} NAME;

We would like to be able to pass round BASE* p; and say

if (p->tag == POINT_TAG) ... handle a point ...
else if (p->tag == NAME) ... handle a name.


typedef struct base_s {
int tag;
} BASE;

typedef struct point_s {
BASE b;
double x, y, z,
} POINT;

typedef struct name_s {
BASE b;
char name[10];
} NAME;


if (POINT_TAG == p->tag) handle_point((POINT *)p);


Or, in your original example, forget BASE completely, and pass around
pointers-to-int ((int*)).

if (POINT_TAG == *p) handle_point((POINT *)p);

lacos
 
G

Gene

typedef struct base_s {
 int tag;
} BASE:
typedef struct point_s {
 int tag;
 double x, y, z,
} POINT;
typedef struct name_s {
 int tag;
 char name[10];
} NAME;
We would like to be able to pass round BASE* p; and say
if (p->tag == POINT_TAG) ... handle a point ...
else if (p->tag == NAME) ... handle a name.

typedef struct base_s {
  int tag;

} BASE;

typedef struct point_s {
  BASE b;
  double x, y, z,

} POINT;

typedef struct name_s {
  BASE b;
  char name[10];

} NAME;

if (POINT_TAG == p->tag) handle_point((POINT *)p);

Or, in your original example, forget BASE completely, and pass around
pointers-to-int ((int*)).

if (POINT_TAG == *p) handle_point((POINT *)p);

lacos- Hide quoted text -

- Show quoted text -

Alas, no. Other articles in this group indicate, and they appear
correct, that this is not compliant because the p in p->tag is not
"suitably converted." If p does not point to a BASE (in my example),
p->tag has no meaning under the Standard. I've seen an example
similar to my BASE / POINT where a certain version of GCC actually
demonstrated bad behavior because a record containing a tag and a
double was aligned differently that one containing only a tag. I.e. in
this environment, casts between struct types could generate pointer
adjustment code. I recall it was a cross-compiler for an embedded
processor, so this is probably not a common occurrence. I have never
seen it in e.g. an x86 setting.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top