Object persistence in C

J

jacob navia

I am writing software to make a general storage
facility of any kind of objects to/from disk.

The intermeidate format used is XML, using the schema
(modified a bit) of Microsoft: xmlns="x-schema:xop-schema.xml"

Operation:
----------
The software generates several C functions that implement the
writing of the XML. To make things more concrete suppose
the following setup:

typedef struct tagG {
int tab[10];
} Tab;
typedef struct tagstruct {
char a;
short b;
int c;
unsigned d;
long e;
long long f;
long double g;
double h;
char * str;
Tab tab;
struct tagstruct *Next;
} structure;

The "wizard" software generates the following functions:
----------------------------------------------
//@ Serialization function for structure structure
int structureSerialize(structure *data,FILE *out)
{
int i;
unsigned char *p;
if (data == NULL)
return 0;
if (!initialized) {
InitXmlWriter(out);
initialized=1;
}
fprintf(out,"<Object id=\"ID%x\"
typename=\"structure\">\n",(int)data);
fprintf(out,"\t<byte name=\"a\">%d</byte>\n",data->a);
fprintf(out,"\t<int name=\"b\">%d</int>\n",data->b);
fprintf(out,"\t<int name=\"c\">%d</int>\n",data->c);
fprintf(out,"\t<unsignedInt
name=\"d\">%u</unsignedInt>\n",data->d);
fprintf(out,"\t<int name=\"e\">%d</int>\n",data->e);
fprintf(out,"\t<long name=\"f\">%ll</long>\n",data->f);
// Type long double not supported natively.
// Using hexadecimal encoding
p = (unsigned char *)&data->g;
fprintf(out,"\t<bin.hex name=\"g\">");
for(i=0; i<12;i++) {
fprintf(out,"%x",*(p++) & 0xff);
}
fprintf(out,"</bin.hex>\n");
fprintf(out,"\t<double name=\"h\">%.15g</double>\n",data->h);
// Assume char * points to strings
fprintf(out,
"\t<string name=\"str\" xml:space=\"preserve\">%s</string>\n",
data->str);
fprintf(out,"\t<IDREF name=\"tab\">ID%x</IDREF>\n",&data->tab);
fprintf(out,"\t<IDREF name=\"Next\">ID%x</IDREF>\n",data->Next);
fprintf(out,"</Object>\n");
structureSerialize(data->Next,out); // follow the Next pointer
TabSerialize(&data->tab,out); // Follow embedded structures
return 1;
}
-----------------------------------------------------------------
This function, when called will generate the following xml:
----------------------------------------------------
<Object id="ID12ff00" typename="structure">
<byte name="a">-56</byte>
<int name="b">3876</int>
<int name="c">-254</int>
<unsignedInt name="d">598877</unsignedInt>
<int name="e">777899</int>
<bin.hex name="g">000000080ff7f00</bin.hex>
<double name="h">687.988877</double>
<string name="str" xml:space="preserve">A string</string>
<IDREF name="tab">ID12ff40</IDREF>
<IDREF name="Next">ID0</IDREF>
</Object>
---------------------------------------------------------

Design principles:
------------------

1) The software will follow pointers and should be able to cope with
complicated and messy graphs, even if they contain loops.
To do this it records the address of each object stored.
(Not shown in the example above)
2) Since the address of each object is unique, the implementation
contains no embedded objects, just references (pointers) to
other objects. All objects are stored under the ObjectStore
tag (not shown).

3) Open issues are what to do with:
A) Unions. In my opinion there is no way to know which of the
members of the union is valid, so unions will not be followed
and just stored in binary form.
B) Function pointers. There is no easy way to know what is
the name of the function stored in a function pointer.
Storing the pointer may be useful if the program is loaded
at the same address.

I have followed a bit the literature about this, and I have never
seen any C implementation. Just C++ ones, where the problems are
much bigger than in C since they have to cope with multiple
heritance hierarchies, templates, whatever. Happily in C everything
is much simpler.

Questions:

Are any of you aware of an implementation of this in C?

What would you propose for unions and function pointers?

Are there any other standards for datatypes in XML besides
the one mentioned above?

Thanks in advance for your time

jacob
 
J

Jonathan Bartlett

jacob said:
I am writing software to make a general storage
facility of any kind of objects to/from disk.

You might try comp.programming, since they deal in a lot of the
algorithmic questions.

Jon
 
E

Eric Sosman

jacob said:
I am writing software to make a general storage
facility of any kind of objects to/from disk.
[...]
The "wizard" software generates the following functions:
----------------------------------------------
//@ Serialization function for structure structure
int structureSerialize(structure *data,FILE *out)
{
int i;
unsigned char *p;
if (data == NULL)
return 0;
if (!initialized) {
InitXmlWriter(out);
initialized=1;
}

Is `initialized' a static variable somewhere? If so,
it seems you can have only one XmlWriter stream active at
a time, or maybe even at all.

A possible alternative would be to wrap the FILE* in
a struct of its own along with whatever state variables
are needed, so you can do

XmlWriter *outxml = NewXmlWriter(out);

.... and then pass an XmlWriter* to all the wizard-generated
("charmed?") functions.
fprintf(out,"<Object id=\"ID%x\"
typename=\"structure\">\n",(int)data);

Non-portable (as I expect you know), since the conversion
from pointer to int is implementation-defined and perhaps
meaningless. Even if the conversion does something simple
like "just copy the bits," the generated object IDs might
not be unique (if int is narrower than pointer, say, or if
dynamic memory management re-uses a free()d object's memory).
fprintf(out,"\t<byte name=\"a\">%d</byte>\n",data->a);
fprintf(out,"\t<int name=\"b\">%d</int>\n",data->b);
fprintf(out,"\t<int name=\"c\">%d</int>\n",data->c);
...

Bleah. Have you considered a table-driven solution?
// Type long double not supported natively.
// Using hexadecimal encoding
p = (unsigned char *)&data->g;
fprintf(out,"\t<bin.hex name=\"g\">");
for(i=0; i<12;i++) {
fprintf(out,"%x",*(p++) & 0xff);
}

Non-portable, of course.
structureSerialize(data->Next,out); // follow the Next pointer
TabSerialize(&data->tab,out); // Follow embedded structures

I'd have expected these to be done in the opposite order
(but I haven't read the M'soft specs). Either way, though,
using recursion to chase what might be a long linked list is
not a wonderful idea.
return 1;

If `1' means "success," maybe this should be written
as `return !ferror(out);' or some such.
3) Open issues are what to do with:
A) Unions. In my opinion there is no way to know which of the
members of the union is valid, so unions will not be followed
and just stored in binary form.

Hence non-portable.
B) Function pointers. There is no easy way to know what is
the name of the function stored in a function pointer.
Storing the pointer may be useful if the program is loaded
at the same address.

... and hasn't been recompiled or even relinked, and
hasn't been loaded with a newer version of a shared library,
and isn't running under a debugger and ...

There's also the problem that C doesn't define the
conversion of a function pointer to any numeric datum; the
only way to get a portable representation would be to deal
with the pointer's constituent bytes. The byte stream would
be interpretable by but meaningless to a recipient other than
the same program (if lucky), hence non-portable.

If you have a table of "pointable" functions you can
translate the pointer to a name easily enough -- and such
a table would seem necessary on the receiving end, to get
from name back to function pointer again. If you get hold
of a function pointer whose target is not in your table,
I think you should announce a serialization failure.
Questions:

Are any of you aware of an implementation of this in C?

What would you propose for unions and function pointers?

If you can't support them usefully, don't support them
at all. Opinion only; YMMV.
Are there any other standards for datatypes in XML besides
the one mentioned above?

I don't know. Probably. My counter-question: Since you're
committed to a non-portable representation anyhow (c.f. the
treatment of `long double'), why fool around with XML? What
advantage does it offer if the portably-packaged content isn't
itself portable?
 
B

Bilgehan.Balban

jacob said:
3) Open issues are what to do with:
A) Unions. In my opinion there is no way to know which of the
members of the union is valid, so unions will not be followed
and just stored in binary form.
B) Function pointers. There is no easy way to know what is
the name of the function stored in a function pointer.
Storing the pointer may be useful if the program is loaded
at the same address.
What would you propose for unions and function pointers?

jacob

You could provide serialisation functions that take a list of union
types and/or already assigned function pointers for that particular
structure, in order of appeareance in the structure. You could easily
do this by overloading the function, using your C compiler with
overloading extensions ;->

This is the way I would have done it. Probably you have already thought
of better solutions.

Bahadir
 
J

jacob navia

Thanks for your answer. I reply below:

Eric said:
jacob said:
I am writing software to make a general storage
facility of any kind of objects to/from disk.
[...]
The "wizard" software generates the following functions:
----------------------------------------------
//@ Serialization function for structure structure
int structureSerialize(structure *data,FILE *out)
{
int i;
unsigned char *p;
if (data == NULL)
return 0;
if (!initialized) {
InitXmlWriter(out);
initialized=1;
}


Is `initialized' a static variable somewhere? If so,
it seems you can have only one XmlWriter stream active at
a time, or maybe even at all.

In this first implementation yes. I will improve that later, creating
an output stream type, that will contain the static
data.
A possible alternative would be to wrap the FILE* in
a struct of its own along with whatever state variables
are needed, so you can do

XmlWriter *outxml = NewXmlWriter(out);

Exactly. Thanks for pointing this.
... and then pass an XmlWriter* to all the wizard-generated
("charmed?") functions.




Non-portable (as I expect you know), since the conversion
from pointer to int is implementation-defined and perhaps
meaningless. Even if the conversion does something simple
like "just copy the bits," the generated object IDs might
not be unique (if int is narrower than pointer, say, or if
dynamic memory management re-uses a free()d object's memory).

You are right. Will change that to (intptr_t) and include
Bleah. Have you considered a table-driven solution?

Note that you are seeing the code generated by the "wizard", not
the code of the wizard itself. This is straightforward to generate
and easy to follow.
Non-portable, of course.

True. I have to investigate writing ratios of big precision
integers, since integers are supported with 64 bit precision,
maybe I can express a long double as a/b where a and b are 64 bit
quantities.
I'd have expected these to be done in the opposite order
(but I haven't read the M'soft specs). Either way, though,
using recursion to chase what might be a long linked list is
not a wonderful idea.

You have a point here. But I do not see an easy way out other than
recurse.
If `1' means "success," maybe this should be written
as `return !ferror(out);' or some such.

Yes, good suggestion.
Hence non-portable.

I do not see what I could do other than that.
... and hasn't been recompiled or even relinked, and
hasn't been loaded with a newer version of a shared library,
and isn't running under a debugger and ...

There's also the problem that C doesn't define the
conversion of a function pointer to any numeric datum; the
only way to get a portable representation would be to deal
with the pointer's constituent bytes. The byte stream would
be interpretable by but meaningless to a recipient other than
the same program (if lucky), hence non-portable.

I say "may" be useful. Probably I should bail out with an error, the
same as when I find a union.
If you have a table of "pointable" functions you can
translate the pointer to a name easily enough -- and such
a table would seem necessary on the receiving end, to get
from name back to function pointer again. If you get hold
of a function pointer whose target is not in your table,
I think you should announce a serialization failure.




If you can't support them usefully, don't support them
at all. Opinion only; YMMV.

I think I will do that. Better warn the user of unsupported
features.
I don't know. Probably. My counter-question: Since you're
committed to a non-portable representation anyhow (c.f. the
treatment of `long double'), why fool around with XML? What
advantage does it offer if the portably-packaged content isn't
itself portable?

Well, besides the long double problem, other types are 100% portable.
Other ways to encode the long double in a portable way would be
to split it in mantissa, sign and exponent, and store them in portable
types: mantissa in a 64 bit unsigned integer (supported natively),
sign and exponent (without the bias) as normal integers.

Thanks for your input.

jacob
 
E

Eric Sosman

jacob said:
Thanks for your answer. I reply below:

Well, besides the long double problem, other types are 100% portable.

Well, "100% portable to the implementations where they're
portable." ;-) An `int', for example, is only portable if
its value is in the range -32767 <= i <= 32767, a `char'
(considered as a number) is only portable if 0 <= c <= 127,
and other types have similar "value bands" of portability.

And then there's floating-point: You're converting to
text with "%.15g", but you really don't know how many decimal
digits you need to guarantee that the receiver can read back
exactly the same value the sender serialized. If you can use
C99 features, consider flavors of "%a" instead.

There's also the nasty issue of infinities and NaNs, which
(1) are not supported on all implementations and (2) can have
implementation-defined text formats (see 7.19.6.1/8).
Other ways to encode the long double in a portable way would be
to split it in mantissa, sign and exponent, and store them in portable
types: mantissa in a 64 bit unsigned integer (supported natively),
sign and exponent (without the bias) as normal integers.

I still don't understand why `long double' should be any
more troublesome than `double' or `float'. It's supported on
all conforming C implementations (albeit with different ranges
and precisions, but that doesn't seem to bother you for any
of the other types). Why is `long double' special?
 
J

jacob navia

Eric said:
I still don't understand why `long double' should be any
more troublesome than `double' or `float'. It's supported on
all conforming C implementations (albeit with different ranges
and precisions, but that doesn't seem to bother you for any
of the other types). Why is `long double' special?
Because the XML reader should support natively double/float/64 bit
ints and 32 bit ints. Long double isn't in that list.

This is from the specs I have read at the microsoft site that
described the xop-schema that extends the XML datatype schema.
 
J

jacob navia

Eric said:
[snip]
And then there's floating-point: You're converting to
text with "%.15g", but you really don't know how many decimal
digits you need to guarantee that the receiver can read back
exactly the same value the sender serialized. If you can use
C99 features, consider flavors of "%a" instead.

Using the IEEE 754 representation DBL_DIG is 15. That's why I used
that. Isn't that correct? What value would you use?

And of course, if the reading machine has 16 bits ints, some values
can't be read back as such, or if it doesn't support
floating point, etc etc.
 
K

Keith Thompson

jacob navia said:
Eric Sosman wrote:
[snip]
And then there's floating-point: You're converting to
text with "%.15g", but you really don't know how many decimal
digits you need to guarantee that the receiver can read back
exactly the same value the sender serialized. If you can use
C99 features, consider flavors of "%a" instead.

Using the IEEE 754 representation DBL_DIG is 15. That's why I used
that. Isn't that correct? What value would you use?

I'm jumping into the middle of this without having read some of the
previous discussion, but ...

If you're assuming IEEE 754 representation, you're not writing
completely portable C code. That's not necessarily a horrible
thing, but you should at least document your assumptions.

As for what value you should use, why not just use DBL_DIG? (Or do
you need DBL_DIG+1 to guarantee you can retrieve the original value?
Perhaps a floating-point expert can clarify.)
 
A

Antoine Leca

[ Jumping in the middle of a discussion which is happenning in two forums is
probably a bad idea, but... ]

En jacob navia va escriure:
Because the XML reader should support natively double/float/64 bit
ints and 32 bit ints. Long double isn't in that list.

You are writing for Win32/64, ain't you? Then long double has the same
representation as double on these platforms, so the XML reader will not make
any problem.

Of course Virginia, it is not portable to make such an assumption. But it is
exactly as not portable as would it be to assume that long long int is _not_
an 128-bit wide type (which is not handled either).


Antoine
 
J

jacob navia

Antoine said:
[ Jumping in the middle of a discussion which is happenning in two forums is
probably a bad idea, but... ]

En jacob navia va escriure:
Because the XML reader should support natively double/float/64 bit
ints and 32 bit ints. Long double isn't in that list.


You are writing for Win32/64, ain't you? Then long double has the same
representation as double on these platforms, so the XML reader will not make
any problem.

This is only true if you use Microsoft's compilers. Using lcc-win32,
or gcc will give you TRUE long doubles with 80 bits precision as
the machine allows.

Microsoft used to support true long doubles up to MSC 5.1 if
I remember correctly. Then, they dropped it for mysterious
reasons. The machine supports long doubles natively.

jacob
 
K

Keith Thompson

jacob navia said:
Antoine said:
[ Jumping in the middle of a discussion which is happenning in two forums is
probably a bad idea, but... ]
En jacob navia va escriure:
Because the XML reader should support natively double/float/64 bit
ints and 32 bit ints. Long double isn't in that list.
You are writing for Win32/64, ain't you? Then long double has the
same
representation as double on these platforms, so the XML reader will not make
any problem.

This is only true if you use Microsoft's compilers. Using lcc-win32,
or gcc will give you TRUE long doubles with 80 bits precision as
the machine allows.

As far as the language is concerned, a long double type that's larger
than double is no more or less "true" than one that's the same size as
double.

It's common for two or more of the predefined integer types to be the
same size. It's probably not as common for the predefined
floating-point types, but it's equally valid.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top