directly serializing structs

C

Cagdas Ozgenc

Greetings,

When directly serializing C++ structures to a file with the standard
library functions giving the address of the data and length of
structure using the sizeof operator, do I risk portability because of
different compilers packing structures into different sizes or
components of this structure to different address boundaries (for
example placing in multiples of 4 on a 32bit system)? Once the file is
serialized, does the same code compiled by another compiler or even
the same compiler but a different version carry the risk of not
reading the contents properly?

Thank you
 
J

John Harrison

Cagdas said:
Greetings,

When directly serializing C++ structures to a file with the standard
library functions giving the address of the data and length of
structure using the sizeof operator, do I risk portability because of
different compilers packing structures into different sizes or
components of this structure to different address boundaries (for
example placing in multiples of 4 on a 32bit system)?

Yes. You also risk portability problems because different compilers (or
platforms) having different formats for individual data items. Byte
ordering for integers often varies across platforms, floating point
formats can even vary for different compilers on the same platform.

Once the file is
serialized, does the same code compiled by another compiler or even
the same compiler but a different version carry the risk of not
reading the contents properly?

Yes, see above.

If this is a concern then consider using text, it's much more portable.
 
I

Ian Collins

Cagdas said:
Greetings,

When directly serializing C++ structures to a file with the standard
library functions giving the address of the data and length of
structure using the sizeof operator, do I risk portability because of
different compilers packing structures into different sizes or
components of this structure to different address boundaries (for
example placing in multiples of 4 on a 32bit system)? Once the file is
serialized, does the same code compiled by another compiler or even
the same compiler but a different version carry the risk of not
reading the contents properly?
Yes, it does. Internal layout is implementation defined.
 
J

JohnQ

Cagdas Ozgenc said:
Greetings,

When directly serializing C++ structures to a file with the standard
library functions giving the address of the data and length of
structure using the sizeof operator, do I risk portability because of
different compilers packing structures into different sizes or
components of this structure to different address boundaries (for
example placing in multiples of 4 on a 32bit system)? Once the file is
serialized, does the same code compiled by another compiler or even
the same compiler but a different version carry the risk of not
reading the contents properly?

Does your software/application require that much portability? Why struggle
with "write once, compile anywhere" if you're only targeting one platform or
even only one machine, for instances?

I wouldn't call what you described above "serializing" though. To me,
"serializing" has the connotation that you indeed are looking into structs
and the sizes of data members, their padding etc. or using ASN.1/BER for
over the wire transmission, for another example, rather than just doing a
struct-sized write. The recommended practice of streaming everything at
every boundary (disk, wire) seems unnatural and tedious to me also. I guess
a layer at the boundaries that does the streaming on the non-primary
platform and doesn't do anything on the primary platform isn't that bad to
implement.

I can think of 3 issues that prevent the the "blast struct all over" concept
from working: endianess, padding/alignment, datatype sizes. The first one is
the party spoiler. Guaranteed width integers helps for the last issue.
Byte-aligning data (no padding) is probably available on most compilers (?).
Endianess though, well there's not much you can do about that to make the
concept work. Luckily, the users of big endian machines are mostly
categorizably different from little endian machine users, so you can just
pick your target users and tailor your software to them. Or else do the
conversions:

struct on Intel going over wire to a Sparc -> no change to struct
struct coming into Sparc from Intel -> convert struct endianess
struct on Sparc going to disk -> convert struct endianess
struct on Sparc going to Intel -> convert struct endianess
struct coming into Intel from Sparc -> no change to struct
stuct on Intel going to disk -> no change to struct

(The above scenario assumes platform-independent files are desired. If not,
fewer conversions required).
(Yes, before anyone quips, I do know that "network byte order" is big
endian. There's also more Windows machines than Unix).

(Issue 4: size of a byte).

John
 
D

Dave Rahardja

When directly serializing C++ structures to a file with the standard
library functions giving the address of the data and length of
structure using the sizeof operator, do I risk portability because of
different compilers packing structures into different sizes or
components of this structure to different address boundaries (for
example placing in multiples of 4 on a 32bit system)? Once the file is
serialized, does the same code compiled by another compiler or even
the same compiler but a different version carry the risk of not
reading the contents properly?

Serialization is a complex issue that has so far eluded a truly general
solution, primarily because the needs of each developer varies so much. There
are several "classes" of techniques, though. They are described quite well in
the C++ FAQ Lite pages:
http://www.parashift.com/c++-faq-lite/serialization.html

When I want to do general-purpose, cross-platform, binary-compatible
exchanges, I generally:

1. Pack data structures to the byte (using #pragmas, most times)
2. Use fixed-width integer types
3. Choose an endian representation and provide conversion facilities
4. Use IEEE representation for floating point numbers, else use fixed point
notation
5. Serialize PODs and structs only, not class hierarchies

An adaptation library with conditional compilation switches can be made for
items 1-4 that allows you to encapsulate the compiler- or platform-specific
behaviors.

-dr
 
J

JohnQ

Dave Rahardja said:
Serialization is a complex issue that has so far eluded a truly general
solution, primarily because the needs of each developer varies so much.
There
are several "classes" of techniques, though. They are described quite well
in
the C++ FAQ Lite pages:
http://www.parashift.com/c++-faq-lite/serialization.html

When I want to do general-purpose, cross-platform, binary-compatible
exchanges, I generally:

1. Pack data structures to the byte (using #pragmas, most times)

I think "no padding" may indeed be a feature that a new language could
exploit.
2. Use fixed-width integer types
3. Choose an endian representation and provide conversion facilities

That's the key one. If there were one gift that the hardware vendors good
give, it would be to standardize endianess. IMO. OK, it's little endian from
now on. Let's move on! LOL! (Oh wait, can I have a standard definition of
"byte" also?).
4. Use IEEE representation for floating point numbers, else use fixed
point
notation
5. Serialize PODs and structs only, not class hierarchies

By "class hierarchies", I think you mean "derived structs". If there were
more guarantee (or I was so assured) that struct B derived from struct A
would be exactly like a struct containing the data members of A followed
immediately by data members of B, I'd be eventually OK with those
compositions.
An adaptation library with conditional compilation switches can be made
for
items 1-4 that allows you to encapsulate the compiler- or
platform-specific
behaviors.

Grouping those hides the "severity" of 3.

Even with your 1-5, all bets are still off because sizeof(char) could be
different somewhere else (right?).

John
 
I

Ian Collins

JohnQ said:
I think "no padding" may indeed be a feature that a new language could
exploit.
Not if the hardware doesn't support it, or even supports it with a
significant performance hit.
 
J

James Kanze

When directly serializing C++ structures to a file with the standard
library functions giving the address of the data and length of
structure using the sizeof operator, do I risk portability because of
different compilers packing structures into different sizes or
components of this structure to different address boundaries (for
example placing in multiples of 4 on a 32bit system)? Once the file is
serialized, does the same code compiled by another compiler or even
the same compiler but a different version carry the risk of not
reading the contents properly?

Very much so. Even changing the compile flags can cause
problems. About the only time this works is for temporary
files, which are read and written by the same binary imagine.
 
J

James Kanze

Does your software/application require that much portability? Why struggle
with "write once, compile anywhere" if you're only targeting one platform or
even only one machine, for instances?

And one version of one compiler with one set of compiler options.

I guess he's a professional.

[...]
I can think of 3 issues that prevent the the "blast struct all
over" concept from working: endianess, padding/alignment,
datatype sizes.

Representation in general. For floating point, it's a real
problem, even today. For integers, there is also at least one
machine on the market which uses 36 bit ones complement
integers, but it's not very wide spread, and many people can
afford to ignore it.

Just be aware of the restriction, and document it, so that some
maintenance programmer in the future doesn't get bitten. And
whatever you do, document all external formats, so a maintenance
programmer has a chance of implementing them on some future
material.
The first one is the party spoiler.

I'd say that the different representations are even worse.
(Note too that "endianness" isn't a good word, since it suggests
two possible arrangements. At least three are widespread.)
 
J

JohnQ

On Jun 23, 11:58 am, "JohnQ" <[email protected]>
wrote:

(Note too that "endianness" isn't a good word, since it suggests
two possible arrangements. At least three are widespread.)

But that one is called "middle ENDIAN" right? If so, that makes "endianness"
seem OK.

John
 
J

JohnQ

Ian Collins said:
Not if the hardware doesn't support it, or even supports it with a
significant performance hit.

That would be a nice table to see: CPUs and the supported compiler/language
properties. Writing code that will run on all platforms is a waste of effort
when it is known that the software will never be deployed on those other
platforms. Layering on top of C++ to abstract away what needn't be bothered
with on a daily coding basis is the way to go. Just because C++ is "close to
the hardware" doesn't mean you have to program at that low level all of the
time.

John
 
D

Dave Rahardja

On Jun 23, 11:58 am, "JohnQ" <[email protected]>
wrote:

(Note too that "endianness" isn't a good word, since it suggests
two possible arrangements. At least three are widespread.)

But that one is called "middle ENDIAN" right? If so, that makes "endianness"
seem OK.

John

I think the final takeaway of this thread may be this: define your
serialization schema down to the bit level, in a separate document from your
internal program design. Then, provide an interface that allows you to
serialize and unserialize your internal data structures. Then, provide a
compiler/platform specific library to perform the conversions. Then, replace
or conditional-compile the conversion library as needed as your program gets
ported from one compiler/platform to another.
 
J

JohnQ

Dave Rahardja said:
I think the final takeaway of this thread may be this: define your
serialization schema down to the bit level, in a separate document from
your
internal program design. Then, provide an interface that allows you to
serialize and unserialize your internal data structures. Then, provide a
compiler/platform specific library to perform the conversions. Then,
replace
or conditional-compile the conversion library as needed as your program
gets
ported from one compiler/platform to another.

If hardware and language vendors/developers could get their acts together,
think how much simpler it would be to develop software. "High level
languages" that don't abstract away the hardware, aren't!

John
 
J

James Kanze

I've never heard it called anything:). It just is. (There are
also word addressed machines, where it makes no sense to speak
of "endian".)
I think the final takeaway of this thread may be this: define your
serialization schema down to the bit level, in a separate document from your
internal program design. Then, provide an interface that allows you to
serialize and unserialize your internal data structures. Then, provide a
compiler/platform specific library to perform the conversions. Then, replace
or conditional-compile the conversion library as needed as your program gets
ported from one compiler/platform to another.

That's true to a point. In practice, there's no need for any
compiler/platform specific code (except, perhaps, for
performance reasons when dealing with floating point).
 
J

James Kanze

If hardware and language vendors/developers could get their
acts together, think how much simpler it would be to develop
software.

I don't have any problems in that regard today.
"High level languages" that don't abstract away the hardware,
aren't!

But C++ does. And that's precisely what you're complaining
about; the fact that within a C++ program, there is no
endianness, so when you want to serialize, you have to introduce
it. C++ has abstracted away the hardware, and you don't know
how int's are represented on your machine. The external format,
however, has its requirements, since you transmit bytes, and not
ints.

With the exception of floating point, it's child's play, and
never represents more than two or three lines of code (in
applications which generally consist of hundreds of thousands of
lines, if not millions).
 
J

JohnQ

"I've never heard it called anything:). It just is. (There are
also word addressed machines, where it makes no sense to speak
of "endian".)"

Well if saying "endian" suggests to would-be/will-be hardware designers that
there are only two, that would be a good thing. Even a better thing if they
choose to deprecate the less ubiquitous perversions.

John
 
J

JohnQ

Does your software/application require that much portability? Why struggle
with "write once, compile anywhere" if you're only targeting one platform
or
even only one machine, for instances?

"And one version of one compiler with one set of compiler options.

I guess he's a professional."

Seems like massochism rather than professionalism.

[...]

"(Note too that "endianness" isn't a good word, since it suggests
two possible arrangements. At least three are widespread.)"

Perhaps you'd like to update http://en.wikipedia.org/wiki/Endianness. (Yes,
they list 3 endian arrangements).

John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,291
Messages
2,571,493
Members
48,164
Latest member
KerrieWind

Latest Threads

Top