Writing single bits to a file

R

riva

I am developing a compression program. Is there any way to write a
data to file in the form of bits, like write bit 0 then bit 1 and then
bit 1 and so on ....
 
E

Eric Sosman

riva wrote On 10/26/07 16:00,:
I am developing a compression program. Is there any way to write a
data to file in the form of bits, like write bit 0 then bit 1 and then
bit 1 and so on ....

C's I/O streams use characters, nothing smaller.
You should accumulate bits until you have a full
character's worth, then write the character. At the
end, you may need to provide a few extra bits to fill
up the final character.
 
M

mstorkamp

Well, what if I have just 19 bytes to write that means 2 chars and 3
bits!

Ah, no. That means 19 chars and 0 bits. But what's your point? I
create a 1 byte file on my computer and it take 4,096 bytes on my hard
disk (plus room for file name, creation date, etc.)
 
W

Walter Roberson

Well, what if I have just 19 bytes to write that means 2 chars and 3
bits!

This is really an algorithm matter rather than a C matter, as you
are unlikely to find very many systems at all that allow you to
end files on arbitrary bit boundaries, no matter which language you use.


In the "alphabet" known to the compressor, include an artificial
"end of text" token. When you get to the end of the input source,
encode that token; then follow it with whatever amount of junk
bits you need to reach a character boundary. Your decompressor
will know not to examine the junk bits because it will decode the
end-of-text token and know not to go further.

In practice, you will find that it is often useful in a compression
routine to have a "flush dictionary" token with the same property
of "ignore bits until the next octet boundary". Instead of having
a specific "end of text" token, you can use the flush-dictionary
token and then end the file.

The flush-dictionary token allows you to adapt to changing conditions
in the source data; you could imagine, for example, that the
statistics useful for compressing the introduction to a document
might be quite poor for compressing the technical material in the
main document itself. Also, a flush-dictionary token has the
useful property of allowing you to append the compressed versions
of several files together, and have the result decompress to the
original files appended together -- useful for building an archive
for example.
 
E

Eric Sosman

(When you reply to a Usenet post, please quote enough
context to make your reply meaningful as a stand-alone
message. People don't always follow a discussion serially
from its start, and someone whose first encounter with a
thread is something like -- well, like what you've written
below -- will have a hard time discovering what the topic
is, and are likely to ignore it. Thus, you risk depriving
yourself of their expertise. For those just tuning in, riva
asked how to write individual bits to a stream.)

riva wrote On 10/26/07 16:42,:
Well, what if I have just 19 bytes to write that means 2 chars and 3
bits!

I think you mean "bits," not "bytes." And that's why
I wrote
[...] At the
end, you may need to provide a few extra bits to fill
up the final character.

.... in the part you didn't quote. Is there something in
my sentence that you find confusing?
 
C

cr88192

Eric Sosman said:
riva wrote On 10/26/07 16:00,:

C's I/O streams use characters, nothing smaller.
You should accumulate bits until you have a full
character's worth, then write the character. At the
end, you may need to provide a few extra bits to fill
up the final character.

yes. an example here is to maintain an integer (say, 32 bits), and then a
count of how many bits (left or right, depending on bit ordering) have been
written to this word. adding new bits consists of shifting and or'ing.

whenever the count is >=8, you can write a byte (say, with fputc or
whatever...).
after this is done, you shift the value (left or right), and subtract 8 from
the count.

if one can figure it out, it is also possible to write multi-bit values
(with a 32 bit word, up to 24 bits at a time is safe).

usually then, at the end, one will flush any bits remaining in the word to
the file.


oddly enough, bit ordering also naturally effects the word endianness, where
going from high to low (or left to right) results in big-endian ordering,
and going from low to high (or right to left) results in little-endian.

compression people have seemingly never really come into agreement over this
issue though, which is why many compressors do it one way, and others the
other...

 
D

Don Bruder

riva said:
Well, what if I have just 19 bytes to write that means 2 chars and 3
bits!

(Assuming you actually meant 19 BITS, rather than 19 BYTES...)

Then stuff the three leftover bits into a third byte, pad that byte with
5 bits of your choice (personally, I'd go with zeroes, but that's just
me) and write it.

Considering that every currently-in-use OS uses the concept of "blocks",
"chunks", or "clusters" as the smallest possible unit for disk I/O, with
each of them being made up of some multiple of 128 bytes, it's
incredibly unlikely that "saving" 5 bits is going to be meaningful in
any realistic situation.
 
J

Joe Wright

riva said:
Well, what if I have just 19 bytes to write that means 2 chars and 3
bits!

I assume you mean 19 bits. You will have to write 3 bytes (24 bits) to
get your 19 onto the stream.
 
K

Keith Thompson

Don Bruder said:
(Assuming you actually meant 19 BITS, rather than 19 BYTES...)

Then stuff the three leftover bits into a third byte, pad that byte with
5 bits of your choice (personally, I'd go with zeroes, but that's just
me) and write it.

Considering that every currently-in-use OS uses the concept of "blocks",
"chunks", or "clusters" as the smallest possible unit for disk I/O, with
each of them being made up of some multiple of 128 bytes, it's
incredibly unlikely that "saving" 5 bits is going to be meaningful in
any realistic situation.

Sure, but it's not just a matter of saving space. The size of a file
(i.e., the number of bytes that have been written to it) can be an
important piece of information.

Under many operating systems, the size of a file is *recorded* as an
exact number of bytes, even if the physical size on disk is a multiple
of some larger block size. (However, the C standard doesn't guarantee
this; binary files can be padded at the end with null characters.)
 
T

Tor Rustad

riva said:
I am developing a compression program. Is there any way to write a
data to file in the form of bits, like write bit 0 then bit 1 and then
bit 1 and so on ....

The way this can be done in C compression SW, is if you provide your own
bit level I/O functions. However, in standard C those I/O functions need
to commit no less than chunks of size CHAR_BIT to a file, which is at
least an 8 bit object.

Standard practice is to pad the last character, just like we do in
communication or encryption SW. For details, look up a data compression
book, or ask in a data compression news group.
 
E

Eric Sosman

Keith said:
[...]
Sure, but it's not just a matter of saving space. The size of a file
(i.e., the number of bytes that have been written to it) can be an
important piece of information.

That's really not much help for the O.P.'s situation:
He's working on some kind of compression program, which can
generate a stream of output bits that doesn't "chunk" neatly
into an integral number of bytes. No file system I've ever
heard of can record a file length of 10007 bits. And it
gets worse: Some of the best compressors encode their output
in fractions of bits; if a file length of 10007 bits is bad,
a length of 10006.59029663+ bits is *really* bad!

See Walter Roberson's response.
 
M

Malcolm McLean

Don Bruder said:
Considering that every currently-in-use OS uses the concept of "blocks",
"chunks", or "clusters" as the smallest possible unit for disk I/O, with
each of them being made up of some multiple of 128 bytes, it's
incredibly unlikely that "saving" 5 bits is going to be meaningful in
any realistic situation.
Create thousands of such files, then tar them. See if you can detect a
difference between ten thousand 128 byte files and ten thousand 18-byte
files.
 
M

Martin Wells

riva:
I am developing a compression program. Is there any way to write a
data to file in the form of bits, like write bit 0 then bit 1 and then
bit 1 and so on ....


Maybe something like: (Unchecked code)

typedef struct BitWriterFileInfo {
FILE *pfile;
unsigned cur_byte_val,
i_bit;
} BitWriterFileInfo;

void InitBitWriterFileInfo(BitWriterFileInfo *const p, FILE *const
pfile)
{
p->pfile = pfile;
p->cur_byte_val = 0;
p->i_bit = 0;
}


void WriteBit(BitWriterFileInfo *const pout,int const val)
{
if (val)
{
p->cur_byte_val |= 1u << p->i_bit;
}

if (CHAR_BIT == ++(p->i_bit))
{
P->i_bit = 0;
P->cur_byte_val = 0;

fprintf(pout->pfile....
}
}

You'd need a finaliser routine aswell if (bits_written % CHAR_BIT) is
anything other than zero.

Object-orientated programming would fit here very well by the way.

Martin
 
C

Charlie Gordon

Martin Wells said:
riva:



Maybe something like: (Unchecked code)

Unchecked indeed ;-)
typedef struct BitWriterFileInfo {
FILE *pfile;
unsigned cur_byte_val,
i_bit;
} BitWriterFileInfo;

void InitBitWriterFileInfo(BitWriterFileInfo *const p, FILE *const
pfile)
{
p->pfile = pfile;
p->cur_byte_val = 0;
p->i_bit = 0;
}


void WriteBit(BitWriterFileInfo *const pout,int const val)
{
if (val)
{
p->cur_byte_val |= 1u << p->i_bit;

p should be pout
}

if (CHAR_BIT == ++(p->i_bit))
{
P->i_bit = 0;
P->cur_byte_val = 0;

P should be pout too ;-)

You should write the byte to the stream before clearing it!
fprintf(pout->pfile....

putc(pout->cur_byte_val, pout->pfile);
pout->cur_byte_val = pout->i_bit = 0;
You'd need a finaliser routine aswell if (bits_written % CHAR_BIT) is
anything other than zero.
Yes.

Object-orientated programming would fit here very well by the way.

And getting rid of these extra const qualifiers on the function parameters
would improve readability, making parameter name mixup more obvious. A good
illustration of my remarks elsethread.
 
M

Martin Wells

Chqrlie:
And getting rid of these extra const qualifiers on the function parameters
would improve readability, making parameter name mixup more obvious.


To be honest I'll use const wherever I can, unless it's stupid (and by
stupid I mean makes no difference whatsoever, or that the difference
has no benefits). An example of such a "stupid" case would be casting
to a const type, reason being that it doesn't make a difference
whether an R-value is const or not because you can't modify it anyway.
If I take a pointer as an argument to a function, and if I don't
intend on changing the pointer at all, then it makes sense to me to
declare it as const. Also, I don't find that it detracts form
readibility or that it adds confusion, so I suppose we just have
different opinions on this.

Just an aside. . . in C, the whole BitWriterFileInfo thing would be
implemented as follows:

int main(void)
{
BitWriterFileInfo obj;

InitBigWriterFileInfo(&obj, stdout);

/* Use it */

FinaliseBigWriterFileInfo(&obj);

return 0;
}

, whereas in something like C++, it would be:

int main()
{
BigFileWriter obj;

/* Use it */
}

Usually, I'm mad for the ol' procedural programming and I use it 9
times out of 10, but this is definitely one of the places where I'd
opt for the object orientated.

Martin
 
C

Charlie Gordon

Martin Wells said:
Chqrlie:


To be honest I'll use const wherever I can, unless it's stupid (and by
stupid I mean makes no difference whatsoever, or that the difference
has no benefits). An example of such a "stupid" case would be casting
to a const type, reason being that it doesn't make a difference
whether an R-value is const or not because you can't modify it anyway.
If I take a pointer as an argument to a function, and if I don't
intend on changing the pointer at all, then it makes sense to me to
declare it as const. Also, I don't find that it detracts form
readibility or that it adds confusion, so I suppose we just have
different opinions on this.

Given your choice of field names, you shouldn't be too proud of your coding
conventions, stylistic or otherwise. We do disagree on what you consider
'stupid' as well. IMHO, if the need arises to cast a qualified pointer to a
different type in an expression, needlessly unqualifying it at the same time
*is* stupid.

const qualifying the parameters of a 3 line function is completely useless.
Just like casting strcpy to (void) and casting the result of malloc().
 
M

Martin Wells

Chqrlie:
Given your choice of field names, you shouldn't be too proud of your coding
conventions, stylistic or otherwise.


By "field names", do you mean the names I choose for functions and
variables? What in particular is wrong with them? Please take one of
them as an example and point out the (perceived) flaws to me.

We do disagree on what you consider
'stupid' as well. IMHO, if the need arises to cast a qualified pointer to a
different type in an expression, needlessly unqualifying it at the same time
*is* stupid.


That's not what I mean. I meant the likes of:

int i;
double x;

...

i = (int const)x;

Writint "int const" instead of "int" has no merit. (Let's assume that
the cast was present to suppress a compiler warning). But even if we
take a pointer type, we could have:

int i;

char *p = (char*)&i;
char *p = (char const *)&i; /* This is stupid */

const qualifying the parameters of a 3 line function is completely useless.
Just like casting strcpy to (void) and casting the result of malloc().


I disagree. The merit is that you'll get a compiler error if you try
to alter something, which, you didn't intend to alter in the first
place.

The be-all and end-all of it though is that I do be consistent with
const -- i.e. if I'm don't plan on changing it, then make it const.

Martin
 
C

cr88192

Charlie Gordon said:
Given your choice of field names, you shouldn't be too proud of your
coding conventions, stylistic or otherwise. We do disagree on what you
consider 'stupid' as well. IMHO, if the need arises to cast a qualified
pointer to a different type in an expression, needlessly unqualifying it
at the same time *is* stupid.

'const' is a keyword I don't really know if I have ever really used...

reason: it doesn't really do anything, beyond telling the compiler to
complain to me about stuff I should sanely know already anyways...

now, of course, there are cases where I do want the compiler to complain:
missing prototypes.

reason: this leads to very real and very messy results.
it also helps me maintain general strict modularity conventions (I can't
call any code I did not include a header for, or for detecting functions
which have ceased to exist, which is useful when reorganizing my codebase).
....

note that I make this tolerable, mostly because I tend to use a special tool
to 'autoheader' my source code (though, in my case, in a few places this
does lead to special comments telling the tool to ignore certain functions,
....).

const qualifying the parameters of a 3 line function is completely
useless. Just like casting strcpy to (void) and casting the result of
malloc().

I will disagree with the casting the result of malloc.

main reason:
it is necessary if one wants their code to compile in both C and C++
compilers, since C++ usually raises a big fuss about uncast conversions
to/from 'void *'...

 
C

CBFalconer

cr88192 said:
.... snip ...

I will disagree with the casting the result of malloc.

main reason:
it is necessary if one wants their code to compile in both C and
C++ compilers, since C++ usually raises a big fuss about uncast
conversions to/from 'void *'...

If you write code and try to routinely compile it with both C and
C++ compilers you deserve whatever happens to you. The languages
are _different_. The multiple compilers is not normally useful
anyhow, since you can always compile C in such a manner as to be
linkable to C++, but not the reverse.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top