Dos vs Unix style text files

D

Dave Moore

I realize this is a somewhat platform specific question, but I think it is
still of general enough interest to ask it here ... if I am wrong I guess I
will find out 8*).

As we all know, DOS uses two characters (carriage-return and line-feed), to
signal the end of a line, while UNIX uses only one (line-feed). When using
getline in C++, one can only specify a single character as the terminator
(default is '\n'), so if you read a line of text from a DOS-style text file
into a string, there is still a carriage return on the end of it. This then
causes problems, particularly if I want to later concatenate two strings
read in this way.

Perhaps Windoze-based compilers automatically set things up so that both of
the terminator characters are removed and added as needed, but I am using
g++ on cygwin, and I have to deal with this myself. So, is there a general
technique for dealing with this? I don't really want to have to check the
last character each time I read in a string with getline, and remove it if
it is a carriage-return. Actually, I don't even know how I would do that
offhand .. I guess look up ^CR in an ASCII table and check it using the
octal value? Any help would be appreciated.

TIA,

Dave Moore
 
V

Victor Bazarov

Dave said:
I realize this is a somewhat platform specific question, but I think it is
still of general enough interest to ask it here ... if I am wrong I guess I
will find out 8*).

As we all know, DOS uses two characters (carriage-return and line-feed), to
signal the end of a line, while UNIX uses only one (line-feed). When using
getline in C++, one can only specify a single character as the terminator
(default is '\n'), so if you read a line of text from a DOS-style text file
into a string, there is still a carriage return on the end of it. This then
causes problems, particularly if I want to later concatenate two strings
read in this way.

Perhaps Windoze-based compilers automatically set things up so that both of
the terminator characters are removed and added as needed, but I am using
g++ on cygwin, and I have to deal with this myself. So, is there a general
technique for dealing with this? I don't really want to have to check the
last character each time I read in a string with getline, and remove it if
it is a carriage-return. Actually, I don't even know how I would do that
offhand .. I guess look up ^CR in an ASCII table and check it using the
octal value? Any help would be appreciated.

If you open a file that you know _may_ contain \r, just discard them
from the lines before you process your lines further.

V
 
N

Noah Roberts

Dave said:
Perhaps Windoze-based compilers automatically set things up so that both of
the terminator characters are removed and added as needed, but I am using
g++ on cygwin, and I have to deal with this myself.

The OS should be doing it. I believe there is hackary with mounting
mode in cygwin.

So, is there a general
technique for dealing with this?

Usually you open your file in text mode. With cygwin I believe that
folder or whatever has to be 'mounted' in text mode as well...or
something of that order. Read docs in cygwin about mounting.
 
E

E. Robert Tisdale

Dave said:
I realize this is a somewhat platform specific question,
but I think it is still of general enough interest to ask it here.

This is a perfectly valid C++ question.
If I am wrong, I guess I will find out 8*).
As we all know, DOS uses two characters (carriage-return and line-feed),
to signal the end of a line, while UNIX uses only one (line-feed).
When using getline in C++, (default is '\n'),
one can only specify a single character as the terminator
so, if you read a line of text from a DOS-style text file into a string,
there is still a carriage return on the end of it.
This, then, causes problems,
particularly if I want to later concatenate two strings read in this way.

Perhaps Windoze-based compilers automatically set things up
so that both of the terminator characters are removed and added as needed,
but I am using g++ on cygwin, and I have to deal with this myself.

No! The GNU C++ compiler on cygwin will do this for you too.
So, is there a general technique for dealing with this?

Open the file in text mode. This converts
the carriage-return/line-feed sequence to a line-feed on input and
the line-feed or a carriage-return/line-feed sequence on output.
I don't really want to have to check the last character
each time I read in a string with getline
and remove it if it is a carriage-return.
Actually, I don't even know how I would do that offhand.
I guess look up ^CR in an ASCII table and check it using the octal value?
Any help would be appreciated.

If you need to see the carriage-return/linefeed sequence
in your program, open the file in binary mode:

std::ifstream input("input_file_name", std::ios::binary);
 
M

Mike Wahler

Dave Moore said:
I realize this is a somewhat platform specific question, but I think it is
still of general enough interest to ask it here ... if I am wrong I guess I
will find out 8*).

As we all know, DOS uses two characters (carriage-return and line-feed), to
signal the end of a line, while UNIX uses only one (line-feed). When using
getline in C++, one can only specify a single character as the terminator
(default is '\n'), so if you read a line of text from a DOS-style text file
into a string, there is still a carriage return on the end of it. This then
causes problems, particularly if I want to later concatenate two strings
read in this way.

Perhaps Windoze-based compilers automatically set things up so that both of
the terminator characters are removed and added as needed, but I am using
g++ on cygwin, and I have to deal with this myself. So, is there a general
technique for dealing with this? I don't really want to have to check the
last character each time I read in a string with getline, and remove it if
it is a carriage-return. Actually, I don't even know how I would do that
offhand .. I guess look up ^CR in an ASCII table and check it using the
octal value? Any help would be appreciated.

Standard C++ defines a single (abstract) type 'char' value which
denotes 'newline' ('\n'). It does not specify its numeric value
or a mapping to a particular character set. The implementation is
responsible for translating between an external 'end-of-line'
indicator and '\n'. (This happens for streams opened in 'text mode'
(the default).)

If a stream is opened in 'binary mode', no such translation occurs
(however, there may still be a conversion from the 'external' to
'internal' [i.e. in-memory] encoding). IOW in 'binary mode',
'newline' has no meaning.

If you're opening your streams in text mode, and your compiler
is failing to do the proper translations to/from '\n', then
it's non-compliant, broken, or not configured correctly.

Everything You Ever Wanted To Know About C++ Streams:
http://www.langer.camelot.de/iostreams.html


-Mike
 
M

Mike Wahler

Mike Wahler said:
If you're opening your streams in text mode, and your compiler
is failing to do the proper translations to/from '\n', then
it's non-compliant, broken, or not configured correctly.

I think I spoke too soon. Rereading your message, I see
you're trying to read a 'foreign' file format. This means
you'll have to manage the translations yourself. Or alternatively
there exist utilities which can convert files between "DOS text"
and "UNIX text" formats. That might make things easier for you.
Check google.

-Mike
 
M

Mike Wahler

Dave Moore said:
I realize this is a somewhat platform specific question, but I think it is
still of general enough interest to ask it here ... if I am wrong I guess I
will find out 8*).

As we all know, DOS uses two characters (carriage-return and line-feed), to
signal the end of a line, while UNIX uses only one (line-feed). When using
getline in C++, one can only specify a single character as the terminator
(default is '\n'), so if you read a line of text from a DOS-style text file
into a string,

using a UNIX implementation
there is still a carriage return on the end of it. This then
causes problems, particularly if I want to later concatenate two strings
read in this way.

Perhaps Windoze-based compilers automatically set things up so that both of
the terminator characters are removed and added as needed,

Using a Windows implementation, 'end of line' indicators
in the file are automatically translated to '\n' (which
C++ does not assign a specific value).
but I am using
g++ on cygwin, and I have to deal with this myself. So, is there a general
technique for dealing with this? I don't really want to have to check the
last character each time I read in a string with getline, and remove it if
it is a carriage-return. Actually, I don't even know how I would do that
offhand .. I guess look up ^CR in an ASCII table and check it using the
octal value? Any help would be appreciated.

#include <fstream>
#include <iostream>
#include <istream>
#include <string>

/*
Extracts a string from the stream 'is', using
default terminator '\n', and stores the string
in 'line'. If the last character of the extracted
string is equal to 'rem', removes it. Returns a
reference to 'is'.
*/
std::istream& get_xlate_line(std::istream& is,
std::string& line,
char rem = '\r')
{
std::getline(is, line);

if(!line.empty())
{
std::string::iterator e(line.end() - 1);
if(*e == rem)
line.erase(e);
}

return is;
}

/* extract and output strings from a file */
int main()
{
std::ifstream ifs("filename");
std::string line;

while(get_xlate_line(ifs, line))
std::cout << line << '\n';

return 0;
}

-Mike
 
D

Dave Moore

Noah Roberts said:
The OS should be doing it. I believe there is hackary with mounting
mode in cygwin.

So, is there a general

Usually you open your file in text mode. With cygwin I believe that
folder or whatever has to be 'mounted' in text mode as well...or
something of that order. Read docs in cygwin about mounting.

It seems that something a bit different is going on, but your reply led me
in the right direction. I was compiling my executable to use the cygwin
run-time environment (cygwin.dll), rather than the windows environment. I
am pretty sure I set up my cygwin installation to use unix-style text files,
so that might well explain the confusion.

Once I compiled my program to use the windows environment
(using -mno-cygwin, as specified in the cygwin FAQ), everything was groovy.
Thanks for the suggestion!

Dave Moore
 
R

Ron Natalie

Noah said:
Dave Moore wrote:



both of



The OS should be doing it. I believe there is hackary with mountin
mode in cygwin.
The OS might do it, but I rarely see so. The expansion is done in the
language runtime library. What most likely is confused here is that
the CYGWIN environment has the compiler thinking that there is no conversion
needed, but he's giving it files from the DOS world.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top