binary file parsing

J

James Kanze

It can be polished with some utility function templates:
// Name the cast properly, to self-document the code
template<typename T>
inline char* as_buffer(T& d)
{
return static_cast<char*>(static_cast<void*>(&d));
}

I think that there are cases where It still won't work (at least
in some cases, with g++)---even though the standard says it
should. (I think that there is an option in g++ which will make
it work. But since the overall approach is broken anyway,
there's no point.)
// Read binary data from input stream
template <typename T>
inline void read_n(T& d, std::istream& s, std::streamsize const& n)
{
if (!s)
throw std::runtime_error("input stream not readable");
s.read(as_buffer(d), n);
// good idea to check state of s
// one may need to handle endianness and swap bytes
}
Example, read 4 bytes from file stream ifs
int n = 0;
read_n(n, ifs, sizeof(n));

Of course, you won't necessarily read the same value that was
written, unless it was written by the same binary, running on
the same machine.
It should be also possible to specialize read_n function so
it's not necessary to give number of bytes explicitly.

Or just use sizeof in the function.

The only real problem is that the code only works in very
special cases.
 
J

joshuamaurice

Do you mean the cast is broken or the template?
Could you explain what's broken in as_buffer?



Of course. I've indicated byte order may be an issue.




I'm curious, because I've found it quite portable across various
implementations of C++ on Windows and Unix.

Now, casting hackery is bit of a black art according to the guarantees
we have by the standard.

#include <iostream>
int main()
{ int x = 4;
std::cout.write(reinterpret_cast<char*>(&x), sizeof(x));
}

The above is not clearly guaranteed to work by the standard. There are
many passages alluding to and strongly suggesting that you can read
and write to anything through a char*, but nothing that actually says
the above program does what you want it to.

#include <iostream>
int main()
{ int x = 4;
std::cout.write(static_cast<char*>(static_cast<void*>(&x)), sizeof
(x));
}

The above program becomes even more iffy. I think by the letter and
intent the standard, this has undefined behavior. When casting to and
from a void*, you may only cast back to exactly the same type (maybe
with CV qualifier differences). In practice, I believe that
static_cast<char*>(static_cast<void*>(&x))
reinterpret_cast<char*>(&x)
are pretty much equivalent, though I don't think there's a passage in
the standard to support that.

Moreover, you should prefer
reinterpret_cast<char*>(&x)
over
static_cast<char*>(static_cast<void*>(&x))
They both have the same effect in practice, but the reinterpret_cast
says more clearly what you're doing. Going through void* is just an
obfuscation, especially when the static_casts are not in the same line
of code.

Other reasons why that code might not work as intended.
Endianness was mentioned.
The int might not be in two's complement. It might be stored as one's
complement or some other more bizarre format.
Your compiler isn't standards compliant. :)

However, I also don't quite follow what James Kanze is saying.

template<typename T>
inline char* as_buffer(T& d)
{ return static_cast<char*>(static_cast<void*>(&d));
}

James Kanze, what versions of g++ in what cases is it "where It still
won't work (at least
 
J

James Kanze

Do you mean the cast is broken or the template?
Could you explain what's broken in as_buffer?

The fact that it just copies memory, instead of formatting.
Of course. I've indicated byte order may be an issue.

Amongst other things. Size, representation and padding are also
issues.
I'm curious, because I've found it quite portable across
various implementations of C++ on Windows and Unix.

You've just not tested it.
 
J

James Kanze

[...]
However, I also don't quite follow what James Kanze is saying.
template<typename T>
inline char* as_buffer(T& d)
{ return static_cast<char*>(static_cast<void*>(&d));

}
James Kanze, what versions of g++ in what cases is it "where It still
won't work (at least

I don't know all of the exact details, but roughly speaking,
when g++ sees two pointers to different types, *even* if one of
those types is a character type, it assumes no aliasing. So,
for example, if you are writing through a char*, and reading
through an int*, the optimizer will assume that the write
doesn't change the value referenced by the int*, and may reorder
the read and the write.

As I said, I think that there's an option somewhere to turn this
off, and it really shouldn't affect you unless you are using
higher levels of optimization (I think).
 
J

joshuamaurice

    [...]
However, I also don't quite follow what James Kanze is saying.
template<typename T>
inline char* as_buffer(T& d)
{   return static_cast<char*>(static_cast<void*>(&d));
}
James Kanze, what versions of g++ in what cases is it "where It still
won't work (at least
in some cases, with g++)---even though the standard says
it should." I'm also very curious as my company's code
does lots of hackery like this.

I don't know all of the exact details, but roughly speaking,
when g++ sees two pointers to different types, *even* if one of
those types is a character type, it assumes no aliasing.  So,
for example, if you are writing through a char*, and reading
through an int*, the optimizer will assume that the write
doesn't change the value referenced by the int*, and may reorder
the read and the write.

As I said, I think that there's an option somewhere to turn this
off, and it really shouldn't affect you unless you are using
higher levels of optimization (I think).

What version(s)? Are you referring to strict aliasing? I've written
tests on this for gcc 3.4.3, and it appears to correctly assume char*
can alias anything, and it optimizes assuming a short* does not alias
a int*. Is the option you refer to -fno-strict-aliasing?
 
J

James Kanze

On May 25, 8:59 pm, (e-mail address removed) wrote:
James Kanze wrote:
On May 24, 11:04 pm, Mateusz Loskot <[email protected]>
wrote:
[...]
However, I also don't quite follow what James Kanze is saying.
template<typename T>
inline char* as_buffer(T& d)
{ return static_cast<char*>(static_cast<void*>(&d));
}
James Kanze, what versions of g++ in what cases is it
"where It still won't work (at least in some cases, with
g++)---even though the standard says it should." I'm also
very curious as my company's code does lots of hackery
like this.
I don't know all of the exact details, but roughly speaking,
when g++ sees two pointers to different types, *even* if one
of those types is a character type, it assumes no aliasing.
So, for example, if you are writing through a char*, and
reading through an int*, the optimizer will assume that the
write doesn't change the value referenced by the int*, and
may reorder the read and the write.
As I said, I think that there's an option somewhere to turn
this off, and it really shouldn't affect you unless you are
using higher levels of optimization (I think).
What version(s)? Are you referring to strict aliasing? I've
written tests on this for gcc 3.4.3, and it appears to
correctly assume char* can alias anything, and it optimizes
assuming a short* does not alias a int*. Is the option you
refer to -fno-strict-aliasing?

I think so. As I say, I don't remember all of the details; I
just remember people having problems with it.

In general, it's a difficult issue. Possible aliasing reduces
optimization potential significantly---that's why C98 introduced
restrict. So a compiler rightfully tries to assume as much as
possible. On the other hand, something like:

union S
{
double d ;
int i ;
} ;

int f( double* d, int* i )
{
int result = *i ;
*d = 3.1415 ;
}

int
main()
{
S s ;
s.i = 42 ;
f( &s.d, &s.i ) ;
}

is clearly legal, but if f is in a different translation unit
than the union, there's no way the compiler can know that it
can't invert the read and write in f. (I think that there's
even a defect report about this in C. Because such things
pretty much negate what little guarantees there are concerning
aliasing.)
 
J

joshuamaurice

On the other hand, something like:

    union S
    {
        double d ;
        int i ;
    } ;

    int f( double* d, int* i )
    {
        int result = *i ;
        *d = 3.1415 ;
    }

    int
    main()
    {
        S s ;
        s.i = 42 ;
        f( &s.d, &s.i ) ;
    }

is clearly legal, but if f is in a different translation unit
than the union, there's no way the compiler can know that it
can't invert the read and write in f.  (I think that there's
even a defect report about this in C.  Because such things
pretty much negate what little guarantees there are concerning
aliasing.)

Odd that there is a defect report for the C language when I think your
example technically has undefined behavior in C. A defect report for
gcc maybe? It's my understanding that type punning with unions is
basically supported as an extension to the C89 standard by nearly all
C compilers and not by the C89 standard itself, and I don't know about
C99.

As for C++03, I think it does allow for type punning through a union,
but the wording in the standard is not the most clear.

You're right that this is quite unexpected and kills a lot of
guarantees we have. For example, take:

char getFirstByte(int arg)
{ union { int i; char c; }
i = x;
return c;
}

can be rewritten equivalently as:

char getFirstByte(int arg)
{ union { int i; char c; }
int * ip = & i;
*ip = x;
char * cp = & c;
char rv = *cp;
return rv;
}

I expect analogous code transformations when compiling that function's
analogue for user defined types. To implement type punning with a
union, the compiler assumes that for all unions in scope, the types of
members of each union can alias the other types of that union? Good to
know. Thanks.
 
J

James Kanze

Odd that there is a defect report for the C language when I
think your example technically has undefined behavior in C.

No it's not. As the standard is currently written, it's
perfectly legal code, with well defined behavior.
A defect report for gcc maybe? It's my understanding that type
punning with unions is basically supported as an extension to
the C89 standard by nearly all C compilers and not by the C89
standard itself, and I don't know about C99.

Type punning with unions is undefined behavior, according to all
versions of both the C and the C++ standard. Any read access
may only refer to the last element written. But there's no type
punning in my example: the code writes S::i, then reads it, and
only then writes S::d. Unless the compiler reorders the reads
and writes in f, assuming no aliasing.
As for C++03, I think it does allow for type punning through a
union, but the wording in the standard is not the most clear.

The wording definitely could be clearer, but I'm pretty sure
that the intent is to do what C does.
You're right that this is quite unexpected and kills a lot of
guarantees we have. For example, take:
char getFirstByte(int arg)
{ union { int i; char c; }
i = x;
return c;
}

That's undefined behavior. Take a very close look at my
example, and the order of the various accesses.

Note that if I pass f an S*, instead of the int* and the
double*, there's no problem:

int
f( S* s )
{
int result = s->i ;
s->d = 3.1415 ;
return result ;
}

In this case, the aliasing is clearly visible. (Although I've
seen compilers which get this wrong as well. Early versions of
Microsoft C, for example.)
can be rewritten equivalently as:
char getFirstByte(int arg)
{ union { int i; char c; }
int * ip = & i;
*ip = x;
char * cp = & c;
char rv = *cp;
return rv;
}
I expect analogous code transformations when compiling that
function's analogue for user defined types. To implement type
punning with a union, the compiler assumes that for all unions
in scope, the types of members of each union can alias the
other types of that union? Good to know. Thanks.

But of course, the C and the C++ standards don't approuve of
type punning with a union. The "standard sanctioned" way of
type punning is reinterpret_cast. But here, too, unless the
reinterpret_cast is visible in the function where the punning
takes place, I don't think it reasonable to expect it to work.
Regardless of the actual wording in the standard.
 
J

joshuamaurice

No it's not.  As the standard is currently written, it's
perfectly legal code, with well defined behavior.


Type punning with unions is undefined behavior, according to all
versions of both the C and the C++ standard.  Any read access
may only refer to the last element written.  But there's no type
punning in my example: the code writes S::i, then reads it, and
only then writes S::d.  Unless the compiler reorders the reads
and writes in f, assuming no aliasing.

I apologize. I missed your point. You are indeed correct. That is a
defect in the compiler and the C standard as well. Remind me to use
unions with great caution now, and thoroughly test the results.
 
G

Giovanni Deretta

I don't know all of the exact details, but roughly speaking,
when g++ sees two pointers to different types, *even* if one of
those types is a character type, it assumes no aliasing.  So,
for example, if you are writing through a char*, and reading
through an int*, the optimizer will assume that the write
doesn't change the value referenced by the int*, and may reorder
the read and the write.

GCC definitely allows char pointers to alias anything, as required by
the standard. The documentation for -fstrict-aliasing explicitly says
that a character type may alias anything.

Anyways, future versions of GCC might have a more conservative TBAA:
every store (except for char) will be considered to change the dynamic
type of the stored location. This still allow many type based
optimizations, but will not break some legal usages of unions and
placement new that are sometimes miscompiled by gcc.

See http://gcc.gnu.org/wiki/MemoryModel for details.
As I said, I think that there's an option somewhere to turn this
off, and it really shouldn't affect you unless you are using
higher levels of optimization (I think).

It is enabled by default at -O2.

HTH,
 
J

James Kanze

The C standards committee does not agree with you.
They ruled that your example breaks the aliasing rules and has
UB.
Seehttp://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_236.htm

Yes, but the text they cite to defend that position doesn't say
what they try to make it say. For their argument to hold,
§6.5/7 (in the C standard) would have to say something about
complete types. As it holds, the code above fulfills the
requirements of §6.5/7, since the object s.i is only accessed
through an lvalue expression with type i, and the object s.d
is only accessed through an lvalue expression with type d.

Of course, what they are arguing may very well be what they
meant to say. If so, then the wording in §6.5/7 is defective.
 
J

James Kanze

The term "perfectly" sounded strange to me too, though I was
trying to figure out if I understand it well or not, so I
pulled out my voice.
Perhaps, James considers legality of this code as follows:
"However, it is an extremely common idiom and is well-supported
by all major compilers"

No. The reason I raised this is because the above example does
NOT work with g++ (at least in some versions, with some
optimization options).

There are practical reasons why it doesn't work, and the authors
of the standard may not have meant that it should work, but as
currently worded, the standard guarantees that it does work (but
most compilers don't).
 
J

James Kanze

On May 26, 4:15 am, James Kanze <[email protected]> wrote:

[...]
I apologize. I missed your point. You are indeed correct. That
is a defect in the compiler and the C standard as well.

I think it is a defect in the standard, and that the compiler is
just implementing what was intended (which is not what is
written).
Remind me to use unions with great caution now, and thoroughly
test the results.

In general, use any aliasing with great caution. If the
aliasing is visible to the compiler (in the function body), I
would expect, from a quality of implementation point of view,
the code to do what is expected. If the aliasing isn't visible
(and most compilers don't look beyond the limits of a single
function), then all bets are off. Regardless of what the
standard actually says.
 
B

Bart van Ingen Schenau

James said:
Yes, but the text they cite to defend that position doesn't say
what they try to make it say. For their argument to hold,
§6.5/7 (in the C standard) would have to say something about
complete types. As it holds, the code above fulfills the
requirements of §6.5/7, since the object s.i is only accessed
through an lvalue expression with type i, and the object s.d
is only accessed through an lvalue expression with type d.

Of course, what they are arguing may very well be what they
meant to say. If so, then the wording in §6.5/7 is defective.


On re-reading the issue and the committee argumentation, I have to agree
with you.

Bart v Ingen Schenau
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,158
Messages
2,570,882
Members
47,414
Latest member
djangoframe

Latest Threads

Top