... I was just brushing over that
section of the standard and am quite interested in your comments
regarding the following
citation :
7.19.2 Streams
3 A binary stream is an ordered sequence of characters that can
transparently record internal data. Data read in from a binary stream
shall compare equal to the data that were earlier written out to that
stream, under the same implementation. Such a stream may,however, have
an *implementation-defined number of null characters* appended to the
end of the stream.
Does this mean that a strictly conforming program is unable to
create or use a binary stream because if it assumes a certain no. of
null characters at the end of the stream, this no. may change from
implementation to implementation? Of course fseek can always be used
.... but we cannot rely on the no. of nulls at the end of the stream?
A strictly conforming program can still create or use a binary
stream, provided it makes no assumptions about such null bytes.
This restriction comes from real-world operating systems, including
both VMS and MS-DOS. On these systems, some files come in "fixed
length record" format, typically an exact multiple of 512 bytes
long[%]. If you write 510 bytes to such a binary file, then close
it, then open it and read it back, you will see that the file
contains 512 bytes. The last two, as long as the file is both
written and then read-back in C, must be '\0' (in other languages
they could conceivably have other values). Write 513 bytes, and
this file will be 1024 bytes long, the last 511 of them being '\0'.
-----
[%] The fixed record size is not necessarily 512 -- 1024 and
2048 are not terribly uncommon, and 256 and 128 did occur;
and even non-powers-of-two are possible sometimes. But 512
is by far the most common number.
-----
On some of these file systems, text files are (or were once) also
exact multiples of 512 bytes long. To tell where a text file should
*appear* to end, the I/O library appends a special "end-of-file"
marker character, often a control-Z ('\034') byte. Reading any
file -- including a binary file -- "as if" it were text then returns
EOF when the marker is encountered. Contrariwise, reading a text
file in binary allows you to "read past the end". (VMS's RMS is
too smart to allow you to read a file in the wrong mode by mistake.
MS-DOS is pretty stupid and will let you do whatever you like.
It inherited this from CP/M and QDOS, from which it was cloned.
Current DOS file system formats as used on Windows do specify file
sizes in bytes, but the now-pointless control-Z protocol is often
retained anyway.)
As a quick (if not terribly efficient) illustration of how one can
deal with '\0' padding in binary files, imagine that your C code
to write a block of binary data to a binary file is coded as:
void write_a_block(unsigned char *data, size_t len) {
size_t written;
written = fwrite(data, 1, size, output_file);
if (written != size)
... handle output error ...
}
If you fopen some file for "wb", call write_a_block several times,
then fclose the file, the file may have '\0' bytes appended, so
there is no way to read the file back without possibly getting
those extra '\0' bytes. But suppose we change write_a_block() to
read:
void write_a_block(unsigned char *data, size_t len) {
size_t written;
unsigned long l;
unsigned char b[4];
if (len == 0)
... handle "asked to write 0 bytes" error ...
if (len > 0xffffffffU)
... handle block-too-big error ...
l = len; /* we use l in case size_t is < 32 bits long */
b[0] = l >> 24; /* mask not necessary: we know l <= 0xffffffff */
b[1] = (l >> 16) & 0xff;
b[2] = (l >> 8) & 0xff;
b[3] = l & 0xff;
written = fwrite(b, 1, 4, output_file);
if (written != 4)
... handle output error ...
written = fwrite(data, 1, size, output_file);
if (written != size)
... handle output error ...
}
Now each block in the file is prefixed with a (nonzero) big-endian
count of the number of bytes in the block. To read the file back,
first read four "unsigned char"s, then assemble the desired length:
nread = fread(b, 1, 4, input_file);
if (nread != 4)
... handle input eof/error ...
l = ((unsigned long)b[0] << 24) +
((unsigned long)b[1] << 16) +
((unsigned long)b[2] << 8) +
(unsigned long)b[3];
if (l > (size_t)-1)
... handle block-too-big error ...
... now fread "l" bytes, as before ...
(The "block-too-big" error can occur if the file is written on
system X, where size_t is a 32-bit unsigned value and the block
is, say, 1048576 bytes -- 1 megabyte -- long, but is to be read
back on system Y, where size_t is a 16-bit unsigned value and can
only count to 65535. This test can be omitted if you are certain
that no blocks will be "too big" this way.)
A file that has '\0' bytes appended will cause l to be 0, because
b[0] through b[3] will all be '\0', which is 0. So instead of
fread()ing "l" bytes, we can check for this. Of course, we must
also handle the condition where nread < 4 (or even is 0) if
this is the last block, so the above code has to read more like
this:
size_t i, nread;
unsigned long l;
nread = fread(b, 1, 4, input_file);
for (i = 0; i < nread && b
== '\0'; i++)
continue;
if (i == nread)
... handle input EOF ...
if (nread < 4)
... handle input error ...
/* build and handle l as before */
Any number of other binary file formats can be designed. To make
it possible to use such files on any hosted C system, however, the
design must account for the possibility of extra '\0' bytes added
to these files. If anyone ever decides to port your code to an
old mainframe, they will be glad you considered this. 