Reading little-endian data from a file in a portable manner

K

Keith Thompson

Malcolm McLean said:
So you're saying we have to say

x = -(int) ux;

to negate an unsigned integer?

To negate an unsigned integer, you simply apply the unary "-" operator
to it. The result of negating an unsigned integer is, of course, an
unsigned integer.

I think what you're trying to do is, given an unsigned int value, store
the (mathematical) negative of that value in a signed int. The obvious
way to do that is, as you say:

unsigned int ux = /* ... */;
int x = -(int)ux;

which works for values of ux in the range 0 to INT_MAX.
Which means that INT_MIN will break.

Yes. Let's assume 16-bit int and the common 2's-complement
representation, so:

INT_MIN is -32768
INT_MAX is +32767
UINT_MAX is +65536

If ux == +32768, you want to store the value -32768 in x.

I can't think of a good way to do that without either treating it as
a special case or depending on implementation-defined or undefined
behavior. (int)ux either gives you an implementation-defined result
or raises an implementation-defined signal.

(In practice, a simple "int x = -ux;" is likely to work because of
the implementation-defined behavior that most implementations happen
to have, but it's specifically not guaranteed by the standard.)
I think you've found a bug in the standard.

I don't think I'd call it a bug; it's just something that's difficult
to do, and I'm not sure there's any reasonable way to correct it.
The presence of an extra negative value for most signed integer
representations can be a problem, but it's imposed by the underlying
behavior of most machines and the need for C to cater to a wide
variety of hardware.

Can you think of a way to fix the standard to make this easier?
 
B

Ben Bacarisse

Malcolm McLean said:
So you're saying we have to say

x = -(int) ux;

to negate an unsigned integer?

That does not work either (as Keith has explained). At least it is no
better in this context that taking your chances with the implementation
defined conversion from unsigned to int (i.e. doing what I originally
suggested -- fixing the shift by using unsigned arithmetic but then just
returning the result as in int).
Which means that INT_MIN will break.

I think you've found a bug in the standard.

I don't think so, but do say what you think is the bug.

I believe you were over-thinking the problem. I posted an alternative
some while back that seemed to go unnoticed:

/* get the two bytes into unsigned int ux and then... */
if (ux >= 0x8000u) {
int x = 0xffffu - ux; /* 0 <= x <= 32767 */
return -x - 1;
}
else return ux;

The data being read is taken to be 2's complement but the code will work
on a machine that is not! It works[1] on machines with any of the
permitted number representations because it deals with arithmetic
values.

It breaks when INT_MIN == -INT_MAX but one could test for this (by
reporting an error when ux == 0x8000u).

[1] If anyone is still reading the thread, I'd like to know if I am wrong
about this.
 
T

Tim Rentsch

Ben Bacarisse said:
Malcolm McLean said:
So you're saying we have to say

x = -(int) ux;

to negate an unsigned integer?

That does not work either (as Keith has explained). At least it is no
better in this context that taking your chances with the implementation
defined conversion from unsigned to int (i.e. doing what I originally
suggested -- fixing the shift by using unsigned arithmetic but then just
returning the result as in int).
Which means that INT_MIN will break.

I think you've found a bug in the standard.

I don't think so, but do say what you think is the bug.

I believe you were over-thinking the problem. I posted an alternative
some while back that seemed to go unnoticed:

/* get the two bytes into unsigned int ux and then... */
if (ux >= 0x8000u) {
int x = 0xffffu - ux; /* 0 <= x <= 32767 */
return -x - 1;
}
else return ux;

The data being read is taken to be 2's complement but the code will work
on a machine that is not! It works[1] on machines with any of the
permitted number representations because it deals with arithmetic
values. [snip 1 request to confirm correctness]

I believe you're right, by why not just this:

/* get the two bytes into unsigned int ux and then... */
return (int)(ux & 32767) - (long)(ux & 32768);
 
T

Thad Smith

Ben said:
I believe you were over-thinking the problem. I posted an alternative
some while back that seemed to go unnoticed:

/* get the two bytes into unsigned int ux and then... */
if (ux >= 0x8000u) {
int x = 0xffffu - ux; /* 0 <= x <= 32767 */
return -x - 1;
}
else return ux;

The data being read is taken to be 2's complement but the code will work
on a machine that is not! It works[1] on machines with any of the
permitted number representations because it deals with arithmetic
values.

It breaks when INT_MIN == -INT_MAX but one could test for this (by
reporting an error when ux == 0x8000u).

[1] If anyone is still reading the thread, I'd like to know if I am wrong
about this.

It certainly looks numerically correct to me, regardless of the representation.
The initialization of x is to the mathematical value for 0xffffu minus ux. The
return value is numerically correct, except for the error case: INT_MIN = -32767
and encoded value = -32768.

The statement that "it breaks when INT_MIN == -INT_MAX" is true only when
INT_MIN=-32767. It doesn't break when INT_MIN == -INT_MAX and INT_MAX is larger.
 
M

Malcolm McLean

I don't think so, but do say what you think is the bug.
ux is INT_MIN *-1 or smaller. We want to set int x to -ux.

The natural code

x = -ux;

is no good.

x = -(int) ux;

is no good, because ux could be -INT_MIN;

if(ux == -INT_MIN)

is no good because INT_MIN is of integer type.

if(ux > INT_MAX)
x = INT_MIN;
else
x = - (int) ux;
is a fix, but not really acceptable.



However my compiler at least doesn't obey your rule

#include <stdio.h>

int main(void)
{
int x;
unsigned short ux = 1;

x = -ux;

printf("%d\n", x);

return 0;
}

Outputs -1, not 0xFFFF, which it would if the type of -ux was
unsigned.
 
M

Malcolm McLean

Malcolm McLean wrote:

But in reality this is never an issue in properly-written code because
you would choose your types according to the values you need them to
represent.
But it is an issue for the perfectly straightforwards problem of
reading a binary integer for a file.

Of course the natural code

x = fgetc(fp) | (fgetc(fp) << 8);
if(x & 0x8000)
x |= -1 << 16;

is perfectly adequate most of the time. But it is allowed to break
because the bitshifting overflows the value of an int.
So we have to resort to storing the bit pattern in a temporary
unsigned.

Then the unsigned has to be converted back into a signed integer,
which creartes more problems, because x = -ux is illegal.

At every step there are subtle problems, which is totally
unacceptable. Of course the problems are there for a reason, removing
them from the standard is quite another challenge.
 
B

Ben Bacarisse

Malcolm McLean said:
ux is INT_MIN *-1 or smaller. We want to set int x to -ux.

The natural code

x = -ux;

is no good.

x = -(int) ux;

is no good, because ux could be -INT_MIN;

if(ux == -INT_MIN)

is no good because INT_MIN is of integer type.

if(ux > INT_MAX)
x = INT_MIN;
else
x = - (int) ux;
is a fix, but not really acceptable.

I was wondering what part of the standard you'd change to make it fit
with your view of how arithmetic should be done.
However my compiler at least doesn't obey your rule

#include <stdio.h>

int main(void)
{
int x;
unsigned short ux = 1;

x = -ux;

printf("%d\n", x);

return 0;
}

Outputs -1, not 0xFFFF, which it would if the type of -ux was
unsigned.

That does not show what you want it to show. For one thing, you've
changed the type of ux so unless you are using a very old machine, -ux
will be of type int. Secondly, even if you correct this (or compile it
for a machine where unsigned short does not promote to int) you'd still,
most likely, get -1. You'd get -1 because that is what the
implementation defined conversion from an unsigned expression like -ux
to int will usually give you.

If your compiler is good at diagnostics, you can do this:

unsigned int ui = 1;
if (-ui < 0) abort();

and you might get told that the comparison can never be true.
Alternatively, just test some example that you think the compiler get
wrong:

unsigned int ui = 1;
if (-ui > 0) puts("Ah! -ui is indeed unsigned.");

If your compiler is getting it wrong, please name it. I'd count it as a
major bug and one that people should be aware of.

By the way, your use of 0xFFFF clouds the issue. You can't ever get
0xFFFF as the output form a %d format.
 
B

Ben Bacarisse

Tim Rentsch said:
Ben Bacarisse <[email protected]> writes:
I believe you're right, by why not just this:

/* get the two bytes into unsigned int ux and then... */
return (int)(ux & 32767) - (long)(ux & 32768);

I'd prefer not to rely on a longer type being available, especially
given the context. This was for getting a 16-bit signed value from a
machine independent format. My code extends naturally to other sizes
without relying on a signed type longer than the on being read.
 
T

Tim Rentsch

Ben Bacarisse said:
I'd prefer not to rely on a longer type being available, especially
given the context. This was for getting a 16-bit signed value from a
machine independent format. My code extends naturally to other sizes
without relying on a signed type longer than the on being read.

Easily done --

return (int)(ux & 32767) - (int)(ux/2 & 16384) - (int)(ux/2 & 16384);
 
B

Ben Bacarisse

Malcolm McLean said:
But it is an issue for the perfectly straightforwards problem of
reading a binary integer for a file.

Of course the natural code

x = fgetc(fp) | (fgetc(fp) << 8);
if(x & 0x8000)
x |= -1 << 16;

That is only natural to you because you are focusing on the
representation not the value. Even if the shifts did not have the
problems that you are now aware of your "natural code" tries to
construct a 2's complement representation rather than trying to
construct the value you want. The code I posted concentrates on the
value of desired result and is, to my mind, more natural.
is perfectly adequate most of the time. But it is allowed to break
because the bitshifting overflows the value of an int.
So we have to resort to storing the bit pattern in a temporary
unsigned.

There is no need to resort to a temporary unsigned. That was the way to
fix your code but that does not mean that it the best (or even the
right) way to tackle the problem. See below for an example.
Then the unsigned has to be converted back into a signed integer,
which creartes more problems, because x = -ux is illegal.

It's not illegal -- it is a perfectly valid C expression! Its
behaviour, for some values of ux, is implementation defined but that is
a long way from illegal.
At every step there are subtle problems, which is totally
unacceptable.

There seemed to be lots of steps because you headed off in the wrong
direction. My corrections prompted you to try to fix the plan but the
real solution is to have another plan. For example, starting from
scratch one might write:

int fget16_2(FILE *fp)
{
int a = fgetc(fp);
int b = fgetc(fp);
if (a == EOF || b == EOF) error("EOF");

if (b >= 128)
return 32767 - b * 256 - a;
else return b * 256 + a;
}

No unsigned arithmetic, no extra temporaries (unsigned or otherwise),
and I would contend that the aren't any subtle problems either but to
some extent that is a matter of definition.

[Note that what looks like a rather odd expression is just the
simplified version of a textbook definition of 2'c complement:

-((b - 128) * 256 + a) - 1

The 2's complement is one less than the 1's complement and since when
b >= 128, (b-128)*256 + 1 is positive, the 1's complement is just
negation. I could have written ~((b - 128) * 256 + a) - 1 but that is
harder to simplify. The reference to 2's complement comes from the fact
that this has been picked as the external representation -- the code
does not rely on C's representation of its types.]
Of course the problems are there for a reason, removing
them from the standard is quite another challenge.

I'm not convinced there is any need to fix anything.

[My apologies if all this has been said already. I got delayed mid post
but I'd typed too much by then to abandon the reply before finishing.]
 
M

Malcolm McLean

int fget16_2(FILE *fp)
{
     int a = fgetc(fp);
     int b = fgetc(fp);
     if (a == EOF || b == EOF) error("EOF");

     if (b >= 128)
          return 32767 - b * 256 - a;
     else return b * 256 + a;

}

No unsigned arithmetic, no extra temporaries (unsigned or otherwise),
and I would contend that the aren't any subtle problems either but to
some extent that is a matter of definition.
but if b << 8 is illegal, won't b * 256 also overflow?
 
B

Ben Bacarisse

Tim Rentsch said:
Easily done --

return (int)(ux & 32767) - (int)(ux/2 & 16384) - (int)(ux/2 & 16384);

Ha! I think my if clearer!

Anyway, as I've said elsewhere, by concentrating on the problems of
Malcolm's code we've got sucked into his plan. I think it is probably
best to do it from the two (int) bytes directly:

if (b >= 128)
return 32767 - b * 256 - a;
else return b * 256 + a;

Obviously there are non-conditional versions of this too, but unless you
have a particularly neat one up your sleeve, I'd want to keep the two
cases separate for clarity.
 
B

Ben Bacarisse

Malcolm McLean said:
but if b << 8 is illegal, won't b * 256 also overflow?

Oh rats, I should have left it un-simplified. I started with:

return -((b - 128) * 256 + a) - 1;

as the "natural" value and multiplied it out, mistakenly. This may
count as one of your totally unacceptable subtleties, but I'd just say
it's a common mistake. One does need to take care when re-organising an
arithmetic expression in C.
 
B

Ben Bacarisse

Ben Bacarisse said:
Anyway, as I've said elsewhere, by concentrating on the problems of
Malcolm's code we've got sucked into his plan. I think it is probably
best to do it from the two (int) bytes directly:

Correction thanks so Malcolm:

| if (b >= 128)
| return -((b - 128) * 256 + a) - 1;
| else return b * 256 + a;

<snip>
 
T

Tim Rentsch

Ben Bacarisse said:
Ha! I think my if clearer!

That doesn't surprise me, but I think that's largely a function of
how you think about the representation. More below...
Anyway, as I've said elsewhere, by concentrating on the problems of
Malcolm's code we've got sucked into his plan. [snip alternative]

I don't exactly agree here. That might or might not be a good way
of approaching the original problem, but the question of how to
turn an unsigned value holding a two's complement representation
into the corresponding signed value is interesting and useful in
and of itself.
Obviously there are non-conditional versions of this too, but unless you
have a particularly neat one up your sleeve, I'd want to keep the two
cases separate for clarity.

Often I have the opposite reaction. Dividing into cases using 'if'
(or ?:, which is similar) typically makes code executionally simpler
but frequently makes it harder to understand whether it works. In
the particular case here, I have to work through the different cases
(especially the "negative" case with all of its subtractions) with
some amount of care, and I'm still not as sure as I would like to be
that it works in all cases. And based on your comments it seems like
that might be true for you also. The two non-conditional versions,
on the other hand, can be seen to be correct immediately from the
definition of two's complement representation in 6.2.6.2 p2.
 
T

Tim Rentsch

Ben Bacarisse said:
Correction thanks so Malcolm:

| if (b >= 128)
| return -((b - 128) * 256 + a) - 1;
| else return b * 256 + a;

<snip>

If you'd just stuck to using unsigned's you wouldn't
have had these problems. :)
 
B

Ben Bacarisse

Snipped to leave just the alternates under consideration:

Tim Rentsch said:
[...] I'd want to keep the two cases separate for clarity.

Often I have the opposite reaction. Dividing into cases using 'if'
(or ?:, which is similar) typically makes code executionally simpler
but frequently makes it harder to understand whether it works. In
the particular case here, I have to work through the different cases
(especially the "negative" case with all of its subtractions) with
some amount of care, and I'm still not as sure as I would like to be
that it works in all cases. And based on your comments it seems like
that might be true for you also. The two non-conditional versions,
on the other hand, can be seen to be correct immediately from the
definition of two's complement representation in 6.2.6.2 p2.

Well, I can see your point and I often find the same. Part of the
reason I don't mind splitting the cases is that one of them needs no
checking. If both branches had equal complexity then the argument for
one form would be even stronger (note "even stronger"!).

Looking at yours for a few moments makes me quite happy with it, but I
can't compare that to how long it would take to get sure of mine because
writing something is so different to reading it. What's more, having
seen the form with the cast to long, the double subtraction need less
time to comprehend.

I'm surprised you are not yet sure of it. My argument (for the
"negative" case) would be that it clearly implements a linear equation
with a multiplier of -1 (a change of +1 in ux makes the result 1
smaller). Given that, and the fact that it covers the right range of
input values, all you need to do is test one boundary case (ux == 0xffff
or ux == 0 are both as easy to verify). Both forms need a quick check
to be sure nothing overflows and I'll grant that this is a little
quicker for your form.
 
T

Tim Rentsch

Ben Bacarisse said:
Snipped to leave just the alternates under consideration:

Tim Rentsch said:
Ben Bacarisse said:
Ben Bacarisse <[email protected]> writes:
if (ux >= 0x8000u) {
int x = 0xffffu - ux; /* 0 <= x <= 32767 */
return -x - 1;
}
else return ux;
/* get the two bytes into unsigned int ux and then... */
return (int)(ux & 32767) - (long)(ux & 32768);
return (int)(ux & 32767) - (int)(ux/2 & 16384) - (int)(ux/2 & 16384);
[...] I'd want to keep the two cases separate for clarity.

Often I have the opposite reaction. Dividing into cases using 'if'
(or ?:, which is similar) typically makes code executionally simpler
but frequently makes it harder to understand whether it works. In
the particular case here, I have to work through the different cases
(especially the "negative" case with all of its subtractions) with
some amount of care, and I'm still not as sure as I would like to be
that it works in all cases. And based on your comments it seems like
that might be true for you also. The two non-conditional versions,
on the other hand, can be seen to be correct immediately from the
definition of two's complement representation in 6.2.6.2 p2.

Well, I can see your point and I often find the same. Part of the
reason I don't mind splitting the cases is that one of them needs no
checking.

Granted, one case is easy, but that's not the end of it,
because there is still the check to be sure which cases
get to which branches.
If both branches had equal complexity then the argument for
one form would be even stronger (note "even stronger"!).

Looking at yours for a few moments makes me quite happy with it, but I
can't compare that to how long it would take to get sure of mine because
writing something is so different to reading it. What's more, having
seen the form with the cast to long, the double subtraction need less
time to comprehend.

Definitely. The double subtraction merits an explanatory
comment (for which simply giving the one-subtraction form
might suffice).

I'm surprised you are not yet sure of it.

My exact comment is that I'm not as sure as I would like to
be, and that's as much a comment about how my brain works
as anything else.
My argument (for the
"negative" case) would be that it clearly implements a linear equation
with a multiplier of -1 (a change of +1 in ux makes the result 1
smaller).

I would say a change of +1 in ux makes the result 1 /larger/,
but I suppose you're talking about absolute value rather than
value. In fact it was this "double negation" aspect that made it
confusing for me and prompted me to look for another solution.
Given that, and the fact that it covers the right range of
input values, all you need to do is test one boundary case (ux == 0xffff
or ux == 0 are both as easy to verify).

Yes, I can check and make sure it works. The way it's coded
makes we worry I've missed a boundary case or an off-by-one
error -- I can go through and check that it works, but it's
hard to see easily that it works correctly in all cases.
Again that's more a statement about my subjective perception,
not an absolute of objective one.

Both forms need a quick check
to be sure nothing overflows and I'll grant that this is a little
quicker for your form.

Thank you for that. To return the compliment I should
add that if someone had asked me cold to code a solution
to this conversion I probably would have written something
very much like what you did. Having the benefit of seeing
your posting prompted me to wonder what might be better.
So you also deserve some of the credit for that. :)
 
M

Marcin Grzegorczyk

<delurk>

Tim said:
Ben Bacarisse said:
Tim Rentsch said:
<snip>
/* get the two bytes into unsigned int ux and then... */
if (ux>= 0x8000u) {
int x = 0xffffu - ux; /* 0<= x<= 32767 */
return -x - 1;
}
else return ux;
<snip>
I believe you're right, by why not just this:

/* get the two bytes into unsigned int ux and then... */
return (int)(ux& 32767) - (long)(ux& 32768);

I'd prefer not to rely on a longer type being available, especially
given the context. This was for getting a 16-bit signed value from a
machine independent format. My code extends naturally to other sizes
without relying on a signed type longer than the on being read.

Easily done --

return (int)(ux& 32767) - (int)(ux/2& 16384) - (int)(ux/2& 16384);
[snip]
Obviously there are non-conditional versions of this too, but unless you
have a particularly neat one up your sleeve, I'd want to keep the two
cases separate for clarity.

Often I have the opposite reaction. Dividing into cases using 'if'
(or ?:, which is similar) typically makes code executionally simpler
but frequently makes it harder to understand whether it works. [...]

Well, here's one I've thought up (incidentally, just a few days before
the stuff above was posted):

if (ux & 0x8000) return (int)(ux - 0x8000) - 0x7FFF - 1;
else return (int)ux;

Or, if you prefer a single-expression version:

return ux & 0x8000 ? (int)(ux - 0x8000) - 0x7FFF - 1 : (int)ux;

Personally, I find this the best of both worlds:

* It's pretty easy to understand. Conceptually, converting an N-bit
value from unsigned to two's complement involves a subtraction of the
value 2**N if the sign bit is set; for N=16, that means a subtraction of
0x10000. We split the subtraction such that the actual cast from
unsigned to signed int always operates on a value in the range of signed
int. (Of course, the code assumes (ux <= 0xFFFF) in the first place,
and, like all the examples above, has undefined behaviour if (INT_MIN ==
-32767).)

* The compiler has a much better chance to optimize it. In fact, on a
typical two's complement implementation with a silent wraparound, any
half-decent (or even quarter-decent) compiler should be able to coalesce
the multiple constant subtractions; and if int is 16 bits wide, or we
make the return type explicitly int16_t (and perhaps adjust the casts to
placate picky compilers), any optimizer should notice that the
subtraction does effectively nothing, and thus both branches of the
conditional give the same result.

I'd like to know what other people think about my constructs, though. :)
 
B

Ben Bacarisse

Marcin Grzegorczyk said:
<delurk>
Welcome.
Well, here's one I've thought up (incidentally, just a few days before
the stuff above was posted):

if (ux & 0x8000) return (int)(ux - 0x8000) - 0x7FFF - 1;
else return (int)ux;

Or, if you prefer a single-expression version:

return ux & 0x8000 ? (int)(ux - 0x8000) - 0x7FFF - 1 : (int)ux;

I like it -- I like them all but, on reflection, I like my own the
least. However...

* The compiler has a much better chance to optimize it. In fact, on a
typical two's complement implementation with a silent wraparound, any
half-decent (or even quarter-decent) compiler should be able to
coalesce the multiple constant subtractions; and if int is 16 bits
wide, or we make the return type explicitly int16_t (and perhaps
adjust the casts to placate picky compilers), any optimizer should
notice that the subtraction does effectively nothing, and thus both
branches of the conditional give the same result.

I'd like to know what other people think about my constructs, though. :)

I has not thought of optimisation and it's a good point. Interestingly
gcc will optimise both mine, yours and Tim's first version to a register
move and a return. I thought mine would be the worst in this respect,
but it seems not. Tim's second version (the one that does not need a
wider signed type) was too clever for gcc.

This holds both for short (16-bit) and int (32-bit) versions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,085
Messages
2,570,597
Members
47,218
Latest member
GracieDebo

Latest Threads

Top