How to read past an end of file character

J

James Aguilar

Hey all,

I'm working on an encoding scheme where I am running into a problem with
reading a file off a stream. Looking at the binary encoding of the file
(using a simple hex editor), there is no problem, and the whole file is
there. However, when I try to read it from cin, at certain times, cin stops
reading. I cannot force cin to go around the bad character, nor, indeed, do
I know what the bad character is.

I am including code at the bottom, but I do not think that will be helpful.
Does anyone know how to read past an end of file character (supposing that
one comes in on the stream from a text file or some similar source but is
not -actually- the end of the file)? The location of the problem is marked
with *** below.

-JFA1

#include <iostream>
#include <vector>
#include <utility>
#include <algorithm>
#include <map>
#include <cassert>

using namespace std;

typedef unsigned char uchar;
typedef unsigned long ulong;
typedef pair<ulong, int> range;
typedef pair<range, uchar> rangewchar;

const ulong lmsbmask = 0x80000000;

int total = UCHAR_MAX;
vector<int> numEnc = vector<int>(256, 1);
map<range, uchar> rangetochar;
map<uchar, range> chartorange;
map<ulong, range> starttorange;

int recalcCount = 0;

int writeCounter;
uchar writeBuf;

int readCounter;
uchar readBuf;

void compress();
void recalculate();
ulong neededSpace(uchar c);

void decompress();
uchar readNextChar();

void flushWriteBuffer();
void writeBits(ulong lng, const int nbits);
void writeBit(bool bit);
bool readBit();

bool tripless(const rangewchar &r1, const rangewchar &r2);
bool tripgreat(const rangewchar &r1, const rangewchar &r2);
bool rless(const range &r1, const range &r2);
bool rgreat(const range &r1, const range &r2);


int main(int argc, char *argv[])
{
if (argc != 2) {
cerr << "Error: incorrect number of command line flags specified.
Bailing.\n"
<< "Usage: ARC [-c|-u]\n";
exit(EXIT_FAILURE);
}

if (!strcmp(argv[1], "-c"))
compress();
else if (!strcmp(argv[1], "-u"))
decompress();

if (writeCounter != 0)
flushWriteBuffer();

return 0;
}

void compress()
{
recalculate();
while (cin.peek() != (char) EOF) {
if (recalcCount == (1 << 8)) {
recalculate();
recalcCount = 0;
}
char next;
cin.get(next);

range nextRange((chartorange.find((uchar) next))->second);
writeBits(nextRange.first, nextRange.second);

++recalcCount; ++numEnc[(uchar) next]; ++total;
}
}

void recalculate()
{
vector<rangewchar> totalinfo(UCHAR_MAX);
for (int i = 0; i < UCHAR_MAX; ++i) {
range pr;
totalinfo.first.second = neededSpace(i);
totalinfo.second = i;
}

sort(totalinfo.begin(), totalinfo.end(), &tripless);

ulong previous = 0;
for (int i = 0; i < UCHAR_MAX; ++i) {
totalinfo.first.first = previous;
previous = previous + (lmsbmask >> (totalinfo.first.second - 1));
chartorange[totalinfo.second] = totalinfo.first;
starttorange[totalinfo.first.first] = totalinfo.first;
rangetochar[totalinfo.first] = totalinfo.second;
}
}

ulong neededSpace(uchar c)
{
double requiredRange = .5, avgratio = (double) numEnc[(unsigned char) c]
/ (double) total;
int bitsNeeded = 1;

while (requiredRange > avgratio) {
requiredRange /= 2;
++bitsNeeded;
}

return bitsNeeded;
}

void decompress()
{
recalculate();
while (cin.peek() != EOF) { //****The problem seems to happen HERE****
if (recalcCount == (1 << 8)) {
recalculate();
recalcCount = 0;
}

uchar nextchar = readNextChar();
cout << nextchar;

++recalcCount; ++numEnc[nextchar]; ++total;
}
}

uchar readNextChar()
{
int nread(0);
ulong tmp(0);
while (true) {
bool nextbit(readBit());

if (nextbit)
tmp |= (lmsbmask >> nread);

map<ulong, range>::const_iterator it(starttorange.find(tmp));
if (it != starttorange.end()) { //If we find a matching start point
assert(nread <= it->second.second);
if (it->second.second == nread+1) //If we've read the right number of
chars
return rangetochar.find(it->second)->second; //Bingo
}
++nread;
}
}

const uchar clsbmask = 0x01;
const uchar cmsbmask = 0x80;

void writeBits(ulong lng, const int nbits)
{
for (int i = 0; i < nbits; ++i) {
writeBit((lng & lmsbmask) == lmsbmask);
lng <<= 1;
}
}

void writeBit(bool bit)
{
if (writeCounter == 8) {
cout.put(writeBuf);
writeCounter = 0;
writeBuf = 0;
}

writeBuf <<= 1;
if (bit)
writeBuf |= clsbmask;
++writeCounter;
}

void flushWriteBuffer()
{
while (writeCounter!=1) {
writeBit(false);
}
}

bool readBit()
{
if (readCounter == 0) {
readCounter = 8;
cin.get(reinterpret_cast<char &>(readBuf));
}

bool retBit = (readBuf & cmsbmask) == cmsbmask;
readBuf <<= 1;
--readCounter;
return retBit;
}

bool rless(const range &r1, const range &r2)
{
return (r1.second) < (r2.second);
}

bool rgreat(const range &r1, const range &r2)
{
return (r1.second) > (r2.second);
}

bool tripless(const rangewchar &r1, const rangewchar &r2)
{
return (r1.first.second) < (r2.first.second);
}

bool tripgreat(const rangewchar &r1, const rangewchar &r2)
{
return (r1.first.second) > (r2.first.second);
}
 
J

JH Trauntvein

James said:
Hey all,

I'm working on an encoding scheme where I am running into a problem with
reading a file off a stream. Looking at the binary encoding of the file
(using a simple hex editor), there is no problem, and the whole file is
there. However, when I try to read it from cin, at certain times, cin stops
reading. I cannot force cin to go around the bad character, nor, indeed, do
I know what the bad character is.

I am including code at the bottom, but I do not think that will be helpful.
Does anyone know how to read past an end of file character (supposing that
one comes in on the stream from a text file or some similar source but is
not -actually- the end of the file)? The location of the problem is marked
with *** below.

Based upon your description above, I assume that your program is being
used under windows. All M$ operating share a common descendency from
that wonderful old OS called CPM. The CPM file system did not have the
meta-data avaialble to know the length of the file so the end of file
character (0x26) was used to mark the termination character of the
file. This unfortunate decision was corrected in earlier versions of
DOS. The unfortunate reality is that, to this day, the operating
system will consider that character the end of the file when the file
is being read in "text mode".

The only solution that I can offer is to open your stream using the
flag ios::binary. A side effect from doing this is that the operating
system will no longer coagulate "\r\n" sequences into a single "\n" as
it does when the file is opened in text mode. The plus side is that it
gives you as a programmer more precise control over what is read from
(or written to) the file.

Regards

Jon Trauntvein
 
J

James Aguilar

JH Trauntvein said:
Based upon your description above, I assume that your program is being
used under windows. All M$ operating share a common descendency from
that wonderful old OS called CPM. The CPM file system did not have the
meta-data avaialble to know the length of the file so the end of file
character (0x26) was used to mark the termination character of the
file. This unfortunate decision was corrected in earlier versions of
DOS. The unfortunate reality is that, to this day, the operating
system will consider that character the end of the file when the file
is being read in "text mode".

The only solution that I can offer is to open your stream using the
flag ios::binary. A side effect from doing this is that the operating
system will no longer coagulate "\r\n" sequences into a single "\n" as
it does when the file is opened in text mode. The plus side is that it
gives you as a programmer more precise control over what is read from
(or written to) the file.

This is exactly what I'm looking for. I don't care about \r\n, since my
task is to read and compress arbitrary data. Thanks!

- JFA1
 
J

James Aguilar

Sorry, one last question: can the method you stated be used with the
standard input and output?

- JFA1
 
L

Lionel B

Sorry, one last question: can the method you stated be used with the
standard input and output?

Yes - but how to set stdin/stdout to binary mode may be platform/compiler-specific (somebody correct me if there is a
standard C++ way to do this...). Using the MinGW gcc compiler on Win32, putting the following code before main() does
the job (this is in the MinGW FAQ, I think):

#include <fcntl.h> // _O_BINARY
unsigned int _CRT_fmode = _O_BINARY; // MinGW: force stdin/stdout to binary mode

Then the following code will read stdin byte-for-byte into a correctly-sized buffer (you'll want to put some error
checking in this!):

using namespace std;

....

cin.seekg(0,ios::end); // set pos to end of stream
size_t len = (size_t)cin.tellg(); // get position
cin.seekg(0); // set pos back to beginning of stream
char* const buffer = new char[len];
cin.read(buffer,len);

....

Again, this works for me on Win32 with MinGW - not sure about portability.

Regards,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top