There isn't a great deal to read...
Why make things complex? JSON is an ideal candidate for representing
structure and array types. It is after all designed as an object notation.
yep, JSON makes sense.
one possible downside though is that it doesn't normally identify object
types, which means some means may be needed to identify what sort of
struct is being serialized, and/or the type of array, ...
it is either that, or deal with data in a dynamically-typed manner,
rather than directly mapping it to raw C structs.
JSON is generally better IMO for serializing dynamically-typed data,
than for doing data-binding against structs.
another minor problem with JSON is that, in its pure form, it has no
good way to deal with cyclic data (where a referenced object may refer
back to a prior object), but an extended form could allow this.
in my case, one mechanism I have serializes a wider range of data into a
binary format, but in my case, only deals with dynamically type-tagged
data (which basically means that it was allocated via my GC API, and
with a type-name supplied), and also that the typedefs have any relevant
annotations (it depends on data gathered by using a tool to parse the
headers in order to work correctly).
it may try to use a special serialization handler if one is registered,
but will otherwise fall back to using key/value serialization of the
structs.
the basic format (for each member) is:
<typeindex:VLI> <size:VLI> <data:byte[size]>
where 'VLI' is a special encoding for variable-length-integers.
in my case:
00-7F 0-127
80-BF XX 128-16383
C0-DF XX XX 16384-2097151
....
a slight VLI variant (SVLI) encoding signed values by folding the sign
into the LSB, so, the values follow a pattern:
0, -1, 1, -2, 2, ...
the format was basically just a flat-array of serialized members, and
these members were linked via index (and allowed both backwards and
forwards references). it has a limited form of value-identity preservation.
the typeindex members are basically indices into this array, giving the
elements which give the type-name (which, as a simplifying assumption,
are assumed to be ASCII strings).
index 0 is reserved for NULL, and is not encoded. index 1 is the first
encoded index, and serves as the "root member" (basically, the "thing
that the program asked the serializer to serialize").
as a later addition, if the typeindex is 0, then the member is a comment
(and does not appear in the member array). a comment member immediately
at the start of the file is used to indicate the "type" of the file
(basically, it is a "magic string"), which is then followed by the root
member.
a partial drawback was that the format doesn't have any good way to
indicate "how" the data is encoded, making it potentially more subject
to versioning issues (consider, for example, if a structure-layout
changes, ...). I have tried previously to develop self-describing
serialization formats in the past, but trying to make a format fully
self-describing tends to make working with it unreasonably complex. (the
basic idea here would be that not only would the format identify the
types in use, but it would also encode information to describe all of
the data encodings used by the format, down to the level of some number
of "atomic" types, ...).
however, my Script-VM's bytecode serialization format is based on the
above mechanism (the bytecode is actually just the result of serializing
the output from compiling a source-module).
some amount of stuff also uses an S-Expression based notation:
999 //integer number
3.14159 //real number
"text" //string
name //symbol (identifier, used to identify something)
:name //keyword (special type of literal identifier)
name: //field or item name
....
( values ) //list of items (composed of "cons cells")
#( values ) //array of items (dynamically typed)
{ key: value ... } //object (dynamically-typed)
#A<sig> ( ... ) //array (statically-typed)
#X<name> { key: value ... } //struct
#L<name> { key: value ... } //instance of a class
....
#idx# //object index (declaration or reference)
#z //null
#u //undefined
#t //true
#f //false
....
so, for example, a struct like:
typdef dytname("foo_t") as_variant //(magic annotations, 1)
struct Foo_s Foo;
struct Foo_s {
Foo *next;
char *name;
int x;
float y;
double z[16];
}
1: these annotations are no-op macros in a normal C compiler (and mostly
expand to special attributes used by the header-processing tool).
"dytname()" basically gives the type-name that will be used when
allocating instances of this struct-type (it is used to key the
type-name back to the struct).
"as_variant" is basically a hint for how it should be handled by my
scripting language. this modifier asserts that the type should be
treated as a self-defined VM type (potentially opaque), rather than be
treated as a boxed-struct or as a boxed "pointer to a struct" (more
literally mapping the struct and/or struct pointer to the scripting
language, causing script code to see it more like how C would see it).
with the structs being created like:
Foo *obj;
obj=gctalloc("foo_t", sizeof(Foo)); //allocate object with type-tag
obj->name=dystrdup("foo_instance_13"); //make a new tagged string
....
might be serialized as:
#0# = #X<foo_t> { next: #1# name: "foo_instance_13" x: 99 y: 4.9 z:
#A<d> ( 2.0 3.0 ... ) }
#1# = #X<foo_t> { ... }
where this format works, but isn't really, exactly, pretty...
I also have a network protocol I call "BSXRP", which basically works
very similar to the above (same data model, ...), just it uses Huffman
coding and predictive context modeling of the data, and "clever" ways of
VLC coding what data-values are sent. (compression is favorable to that
of S-Expressions + Deflate, typically being around 25% the size, whereas
Deflate by itself was reducing the S-Expressions to around 10% their
original size, or IOW: around 2.5% the size of the textual
serialization). (basically, if similar repeating structures are sent,
prior structures may be used as "templates" for sending later
structures, allowing them to be encoded in fewer bits, essentially
working sort of like a dynamically-built schema).
as-before, it has special cases to allow encoding cyclic data, but the
protocol does not generally preserve "value-identity" (value-identity or
data-identity is its own hairy set of issues, and in my case I leave
matters of identity to higher-level protocols).
some tweaks to the format also allow it to give modest compression
improvements over Deflate when being used for delivering lots of short
plaintext or binary data messages (it essentially includes a Deflate64
like compressor as a sub-mode, but addresses some "weak areas" regarding
Deflate).
a minor drawback though is that the context models can eat up a lot of
memory (the memory costs are considerably higher than those of Deflate).
(it was originally written partly as a "proof of concept", but is,
technically, pretty much overkill).
Other languages have better support for manipulating JSON objects, but
at least one of them (PHP) uses a C library under the hood.
yeah...
I use variants of both JSON and S-Expressions, but mostly for
dynamically typed data.
not depending on the use of type-tags and data mined from headers would
require a rather different implementation strategy.
most of my code is C, but I make fairly extensive use of dynamic-type
tagging.
so, yeah, all this isn't really a "general purpose" set of solutions for
the data-serialization process.
I suspect that though that there may not actually be any sort of
entirely "general purpose" solution to this problem though...
and, as-is, to use my implementation would probably require dragging
around roughly about 400 kloc of code, and it is very likely that many
people would object to needing to use a special memory manager and
code-processing tools to be able to use these facilities...
or, IOW:
if you allocate the data with "malloc()" or via a raw "mmap()" or
similar, a lot of my code will have no idea what it is looking at (yes,
a lame limitation, I know).
granted, the scripting language can partly work around it:
if you don't use "as_variant" modifier, the struct will map literally,
and ironically, this allows script-code to still use "malloc()" for
these types.
however, the data serialization code generally isn't this clever...
or such...