Read a line under MS/Unix/Mac

M

mazwolfe

Someone recently asked about reading lines. I had this code written
some time ago (part of a BASIC-style interpreter based on H. Shildts
in Art of C) to read a file with the lines ended in any format:
Microsoft-style CR/LF pair, Unix-style NL, or Mac-style CR. It also
allows for EOF that does not follow a blank line. I thought this would
make text-file sharing a bit easier.

Here it is:
/* Load a file, normalizing newlines to *nix standard (just NL). */
int load_file(FILE *fp, char *buf, int max_size)
{
int i = 0;
char c;

do {
c = getc(fp); /* read the file into memory */
i++; /* keep track of size of file*/
if (c == '\r') { /* read a CR */
c = getc(fp); /* read another character */
if (c != '\n') { /* whoops, not an NL (Mac style) */
*buf++ = '\n'; /* correct, store NL */
i++; /* and update size */
} /* otherwise, c now holds the NL from the CR/NL pair */
} /* c now holds character to put; NL, (CR/)LF, or (new) char
*/
*buf++ = c;
} while ( !feof(fp) && i < max_size );
/* Null terminate the file, check for NL (LF) at end. */
if (buf[-1] != '\n') /* if file didn't end in new line */
*buf++ = '\n', i++; /* tack it on */
*buf = '\0'; /* put null past file */
fclose(fp);
return i; /* size of file loaded */
}

This allows the file to use a mix of different EOLs. Is that a bad
idea?

-- Marty (I still consider myself a newbie)
 
S

santosh

Someone recently asked about reading lines. I had this code written
some time ago (part of a BASIC-style interpreter based on H. Shildts
in Art of C) to read a file with the lines ended in any format:
Microsoft-style CR/LF pair, Unix-style NL, or Mac-style CR. It also
allows for EOF that does not follow a blank line. I thought this would
make text-file sharing a bit easier.

I believe the C Standard library is required to present all text streams
as being composed of zero or more lines, each line being terminated by
a newline character. The actual end-of-line marker of the file is
abstracted away.
Here it is:
/* Load a file, normalizing newlines to *nix standard (just NL). */
int load_file(FILE *fp, char *buf, int max_size)
{
int i = 0;
char c;

do {
c = getc(fp); /* read the file into memory */

Here is your first problem. getc signals end-of-file or error by
returning EOF, an int value. So you should always assign the return
value of getc to an int and convert it to a char only after making sure
that it is indeed a valid character.
i++; /* keep track of size of file*/
if (c == '\r') { /* read a CR */
c = getc(fp); /* read another character */
if (c != '\n') { /* whoops, not an NL (Mac style) */
*buf++ = '\n'; /* correct, store NL */
i++; /* and update size */
} /* otherwise, c now holds the NL from the CR/NL pair */
} /* c now holds character to put; NL, (CR/)LF, or (new) char
*/
*buf++ = c;
} while ( !feof(fp) && i < max_size );
/* Null terminate the file, check for NL (LF) at end. */
if (buf[-1] != '\n') /* if file didn't end in new line */
*buf++ = '\n', i++; /* tack it on */

You like obfuscation don't you. I'd write the two operations above as
separate statements to avoid error.
*buf = '\0'; /* put null past file */
fclose(fp);
return i; /* size of file loaded */
}

This allows the file to use a mix of different EOLs. Is that a bad
idea?

It's taken care of for text files by the Standard library. You only need
to worry when operating on binary files.
 
F

Flash Gordon

santosh wrote, On 05/11/07 19:58:
I believe the C Standard library is required to present all text streams
as being composed of zero or more lines, each line being terminated by
a newline character. The actual end-of-line marker of the file is
abstracted away.

Only if it is what the implementation considers to be a text stream.
Open an old style Mac text file on a Unix machine and the Unix machine
will not see any new lines.

It's taken care of for text files by the Standard library. You only need
to worry when operating on binary files.

Sometimes you should leave it to the implementation, but sometimes you
have to cope with "text files" from a foreign system that have not been
translated, and then you have to deal with it yourself.
 
D

David Tiktin

On Tuesday 06 Nov 2007 1:14 am (e-mail address removed) <
(e-mail address removed)> wrote in article
<[email protected]>:

[snip code]
It's taken care of for text files by the Standard library. You
only need to worry when operating on binary files.

Not true. On many (most? all?) Unix systems, when opening a DOS EOL
file (0x0D 0x0A line endings) using "r", not "rb", the 0x0D
characters are *not* removed from the stream when reading. On DOS
systems, they are removed since DOS recognizes the 2 character
sequence as meaningful. Unix systems don't recognize the sequence as
meaningful so they leave the 0x0Ds.

Dave
 
U

user923005

Someone recently asked about reading lines. I had this code written
some time ago (part of a BASIC-style interpreter based on H. Shildts
Aha! Here's your problem
^^^^^^^^^^

http://www.lysator.liu.se/c/schildt.html
http://ma.rtij.nl/acllc-c++.FAQ.html#q6.4
in Art of C) to read a file with the lines ended in any format:
Microsoft-style CR/LF pair, Unix-style NL, or Mac-style CR. It also
allows for EOF that does not follow a blank line. I thought this would
make text-file sharing a bit easier.

Here it is:
/* Load a file, normalizing newlines to *nix standard (just NL). */
int load_file(FILE *fp, char *buf, int max_size)
{
int i = 0;
char c;

c should definitely be int and not char.
do {
c = getc(fp); /* read the file into memory */

what happens if getc() returned EOF right here? You have no check.
i++; /* keep track of size of file*/
if (c == '\r') { /* read a CR */
c = getc(fp); /* read another character */
if (c != '\n') { /* whoops, not an NL (Mac style) */
*buf++ = '\n'; /* correct, store NL */
i++; /* and update size */
} /* otherwise, c now holds the NL from the CR/NL pair */
} /* c now holds character to put; NL, (CR/)LF, or (new) char
*/
*buf++ = c;
} while ( !feof(fp) && i < max_size );
/* Null terminate the file, check for NL (LF) at end. */
if (buf[-1] != '\n') /* if file didn't end in new line */
*buf++ = '\n', i++; /* tack it on */
*buf = '\0'; /* put null past file */
fclose(fp);
return i; /* size of file loaded */

}

This allows the file to use a mix of different EOLs. Is that a bad
idea?

Don't forget that Macs use '\r', Unix uses '\n' and Windows systems
use "\r\n".

There are programs called dos2unix and unix2dos that come with source
code and accomplish this (there are several variants as I recall).

I guess that the source of an FTP program is probably a lot better,
because it may handle Unix, Windows, OpenVMS, IBM Mainfram, and Mac,
which are all different.
 
E

Eric Sosman

David Tiktin wrote On 11/05/07 15:54,:
On Tuesday 06 Nov 2007 1:14 am (e-mail address removed) <
(e-mail address removed)> wrote in article
<[email protected]>:


[snip code]

It's taken care of for text files by the Standard library. You
only need to worry when operating on binary files.


Not true. On many (most? all?) Unix systems, when opening a DOS EOL
file (0x0D 0x0A line endings) using "r", not "rb", the 0x0D
characters are *not* removed from the stream when reading. On DOS
systems, they are removed since DOS recognizes the 2 character
sequence as meaningful. Unix systems don't recognize the sequence as
meaningful so they leave the 0x0Ds.

What this really means is that the transfer between
the two systems was done incorrectly. You don't have a
"DOS text file," you have a "damaged text file."

When a text file is correctly formed according to the
local conventions (whatever they are, and there are odder
things out there than DOS!), the C library "sees" line
endings as single newline characters. If you need to deal
with damaged files, your problems run deeper than just
ignoring the occasional '\r'.
 
D

David Tiktin

David Tiktin wrote On 11/05/07 15:54,:
On Tuesday 06 Nov 2007 1:14 am (e-mail address removed) <
(e-mail address removed)> wrote in article
<[email protected]>:


[snip code]

This allows the file to use a mix of different EOLs. Is that a
bad idea?

It's taken care of for text files by the Standard library. You
only need to worry when operating on binary files.


Not true. On many (most? all?) Unix systems, when opening a DOS
EOL file (0x0D 0x0A line endings) using "r", not "rb", the 0x0D
characters are *not* removed from the stream when reading. On
DOS systems, they are removed since DOS recognizes the 2
character sequence as meaningful. Unix systems don't recognize
the sequence as meaningful so they leave the 0x0Ds.

What this really means is that the transfer between
the two systems was done incorrectly. You don't have a
"DOS text file," you have a "damaged text file."

OK, but I'm not sure what point you're trying to make. Yes, FTP has
text mode transfers for just this reason. But the "damaged text
file" case (as you call it) was exactly the case the posted code was
supposed to deal with. And it's still not true that the Standard C
Library will give you any help with this, right?
When a text file is correctly formed according to the
local conventions (whatever they are, and there are odder
things out there than DOS!), the C library "sees" line
endings as single newline characters. If you need to deal
with damaged files, your problems run deeper than just
ignoring the occasional '\r'.

I store C source files on a Linux server which are mapped to both
Windows systems (via Samba) and other Unix machines (via NFS) so I
can build on different systems from a common code base. I'm not
clear what the "local conventions" are in a case like that! Very
occassionally I have to take into account that the line endings of a
text file I'm reading may not match those expected by the tools I'm
using by checking for an "extra" 0x0D at the end of a line. But
ususually, just saving everything with Unix EOL works fine.

Dave
 
K

Keith Thompson

santosh said:
I believe the C Standard library is required to present all text streams
as being composed of zero or more lines, each line being terminated by
a newline character. The actual end-of-line marker of the file is
abstracted away.

Yes, if the input file is a text file.

[...]
It's taken care of for text files by the Standard library. You only need
to worry when operating on binary files.

Or when operating on text files copied from a different operating
system, which is a fairly common problem.

If possible, it's usually better to translate the file when copying it
from one system to another, but that's not always possible. Or
rather, it's probably always possible, but it's not always done.

If you assume that the input is in one of those three formats, you can
open it in binary mode and scan for line terminators.

<OT>Note that modern versions of MacOS use Unix-style text files.)
 
C

CBFalconer

David said:
.... snip ...

Not true. On many (most? all?) Unix systems, when opening a DOS
EOL file (0x0D 0x0A line endings) using "r", not "rb", the 0x0D
characters are *not* removed from the stream when reading. On DOS
systems, they are removed since DOS recognizes the 2 character
sequence as meaningful. Unix systems don't recognize the sequence
as meaningful so they leave the 0x0Ds.

If the file is a text file, simply use the appropriate command to
copy it over to the Unix system. And don't use that command to
copy binary files. All done.

That leaves such things as i/o devices to worry about.
 
J

Jeffrey Stedfast

Someone recently asked about reading lines. I had this code written
some time ago (part of a BASIC-style interpreter based on H. Shildts
in Art of C) to read a file with the lines ended in any format:
Microsoft-style CR/LF pair, Unix-style NL, or Mac-style CR. It also
allows for EOF that does not follow a blank line. I thought this would
make text-file sharing a bit easier.

Here it is:
/* Load a file, normalizing newlines to *nix standard (just NL). */
int load_file(FILE *fp, char *buf, int max_size)
{
int i = 0;
char c;

as others have already noted, 'c' should be an int
do {
c = getc(fp); /* read the file into memory */

you need to check for EOF (again, I believe someone already mentioned this)
i++; /* keep track of size of file*/
if (c == '\r') { /* read a CR */
c = getc(fp); /* read another character */

would need to check EOF here as well.
if (c != '\n') { /* whoops, not an NL (Mac style) */
*buf++ = '\n'; /* correct, store NL */

this could potentially cause an overflow problem because you will append
a second character to 'buf' before your next i < max_size check.
i++; /* and update size */
} /* otherwise, c now holds the NL from the CR/NL pair */
} /* c now holds character to put; NL, (CR/)LF, or (new) char
*/
*buf++ = c;
} while ( !feof(fp) && i < max_size );
/* Null terminate the file, check for NL (LF) at end. */
if (buf[-1] != '\n') /* if file didn't end in new line */
*buf++ = '\n', i++; /* tack it on */

you neglected to make sure your buffer had enough room to add the '\n'
*buf = '\0'; /* put null past file */
fclose(fp);
return i; /* size of file loaded */
}

This allows the file to use a mix of different EOLs. Is that a bad
idea?

-- Marty (I still consider myself a newbie)

As a suggestion, you might find it easier to use the fgets() function.

Jeff
 
K

Keith Thompson

user923005 said:
in Art of C) to read a file with the lines ended in any format:
Microsoft-style CR/LF pair, Unix-style NL, or Mac-style CR.
[...]
Don't forget that Macs use '\r', Unix uses '\n' and Windows systems
use "\r\n".

I don't think he's forgotten that.

But note that the '\n' encoding in C refers to a "new-line" character,
which isn't necessarily an ASCII LF character. On an old Mac system,
I would guess that '\n' is actually the ASCII CR character (either
that, or stdio translates CR to LF on input).

If you're dealing only with native text files, you don't have to worry
about this; read the file in text mode, and each line will appear to
be terminated with '\n', whatever value '\n' happens to have.

If you need to deal with non-native text files, the best approach is
probably to open the file in binary mode and provide code to handle
all possible non-native formats. For reading DOS/Windows text files
on a Unix-like system, you can probably get away reading in text mode
and deleting trailling '\r' characters, but that's not a general
solution.

[...]
 
K

Keith Thompson

CBFalconer said:
If the file is a text file, simply use the appropriate command to
copy it over to the Unix system. And don't use that command to
copy binary files. All done.

Alas, it's not that simple. If you copy individual text files from
one system to another, you can usually invoke the copying command in a
mode that causes it to do the proper translations. But file are often
copied as parts of archives (*.tar.gz, *.tar.bz2, *.zip, etc.).

There's still usually a reasonably good way to do the translations,
but not always. And the system on which I'm typing this has numerous
text files in two different formats (I use Cygwin under Windows XP).

The problem can't always be avoided.
 
A

Amandil

On Nov 5, 11:44 am, (e-mail address removed) wrote:> Someone recently asked about reading lines. I had this code written

Aha! Here's your problem
^^^^^^^^^^

http://www.lysator.liu.se/c/schildt.htmlhttp://ma.rtij.nl/acllc-c++.FAQ.html#q6.4

I did say 'based': Indeed I had to change many things, including
redundancies.
c should definitely be int and not char.

Point noted. You're right.
what happens if getc() returned EOF right here? You have no check.

I check for EOF later on, it was easier to ignore it for now. If EOF
is returned, then the low byte is stored, then rewritten with a '\0',
so nothing 'bad' happens.
i++; /* keep track of size of file*/
if (c == '\r') { /* read a CR */
c = getc(fp); /* read another character */
if (c != '\n') { /* whoops, not an NL (Mac style) */
*buf++ = '\n'; /* correct, store NL */
i++; /* and update size */
} /* otherwise, c now holds the NL from the CR/NL pair */
} /* c now holds character to put; NL, (CR/)LF, or (new) char
*/
*buf++ = c;
} while ( !feof(fp) && i < max_size );
/* Null terminate the file, check for NL (LF) at end. */
if (buf[-1] != '\n') /* if file didn't end in new line */
*buf++ = '\n', i++; /* tack it on */
*buf = '\0'; /* put null past file */
fclose(fp);
return i; /* size of file loaded */

This allows the file to use a mix of different EOLs. Is that a bad
idea?

Don't forget that Macs use '\r', Unix uses '\n' and Windows systems
use "\r\n".

That's kinda the 'whole point' of this function: to read a file with
EOLs marked in any of those three ways, convert them _internally_ to
Unix-style (because I think that's nicer; I could have used '\0256'
had I wanted), and then runs the interpreter on the internal buffer.
There are programs called dos2unix and unix2dos that come with source
code and accomplish this (there are several variants as I recall).

My neighbor has a Mac, my brother has Linux, and my dad has Windows. I
want to copy BASIC programs from each of them, using a floppy disk. I
also don't want to have to modify the original files. (This scenario
is not for real, but I'm sure people have similar situations at
times.)
The program is, as mentioned, a BASIC interpreter. As such, it heavily
uses find_eol(). Any parser that uses find_eol() - C++ or C99 //
comments - and at runtime can have DOS/Mac/Unix files should be able
to make use of this function.
I guess that the source of an FTP program is probably a lot better,
because it may handle Unix, Windows, OpenVMS, IBM Mainfram, and Mac,
which are all different.
I am not aware of any EOL markers in use (using ASCII), besides "\xD",
"\xA", and "\xD\xA". Are there any others? Please let me know.

Thanks, and regards.

-- Marty (In pursuit of undomesticated aquatic avians).
 
A

Amandil

Alas, it's not that simple. If you copy individual text files from
one system to another, you can usually invoke the copying command in a
mode that causes it to do the proper translations. But file are often
copied as parts of archives (*.tar.gz, *.tar.bz2, *.zip, etc.).

There's still usually a reasonably good way to do the translations,
but not always. And the system on which I'm typing this has numerous
text files in two different formats (I use Cygwin under Windows XP).

The problem can't always be avoided.

Thanks for the support. Your point is one of the things that got me
going. I wrote many files using Textpad (http://www.textpad.com), and
hated running vi on Cygwin and getting '^M' at the end of every line.
 
B

Ben Bacarisse

Someone recently asked about reading lines. I had this code written
some time ago (part of a BASIC-style interpreter based on H. Shildts
in Art of C)

Just one think that I think has not been commented on yet...
int i = 0;
char c;
do {
c = getc(fp); /* read the file into memory */
i++; /* keep track of size of file*/
if (c == '\r') { /* read a CR */
c = getc(fp); /* read another character */
if (c != '\n') { /* whoops, not an NL (Mac style) */
*buf++ = '\n'; /* correct, store NL */
i++; /* and update size */
} /* otherwise, c now holds the NL from the CR/NL pair */
} /* c now holds character to put; NL, (CR/)LF, or (new) char
*/
*buf++ = c;
} while ( !feof(fp) && i < max_size );

This loop can end because i == max_size. In that case, buff has been
incremented max_size times (nice and easy to reason about -- there is
an 'i++' for every 'buf++' but, personally, I'd put them closer
together). It now points just past the end of the buffer (if the
max_size parameter is indeed the size of the buffer).
/* Null terminate the file, check for NL (LF) at end. */
if (buf[-1] != '\n') /* if file didn't end in new line */
*buf++ = '\n', i++; /* tack it on */
*buf = '\0'; /* put null past file */

*buf is beyond the buffer so this is undefined behaviour. If the last
thing in the buffer is not '\n' the error occurs on the line before.
To be safe, the caller must pass a value in max_size that is two
larger than the buffer size.
fclose(fp);
return i; /* size of file loaded */
}

This allows the file to use a mix of different EOLs. Is that a bad
idea?

In simple cases you can do this but consider:

a\rb\r\nc\nd

is this: (a) Mac:
a
b
\nc\nd

(b) Windows:

a\rb
c\nd

or (c) Unix:

a\rb\r
c
d
?
 
C

Charlie Gordon

Keith Thompson said:
user923005 said:
in Art of C) to read a file with the lines ended in any format:
Microsoft-style CR/LF pair, Unix-style NL, or Mac-style CR.
[...]
Don't forget that Macs use '\r', Unix uses '\n' and Windows systems
use "\r\n".

I don't think he's forgotten that.

But note that the '\n' encoding in C refers to a "new-line" character,
which isn't necessarily an ASCII LF character. On an old Mac system,
I would guess that '\n' is actually the ASCII CR character (either
that, or stdio translates CR to LF on input).

So reading an untranslated MS/DOS file on a Mac would result in bogus
interpretation of 0x0D/0x0A pairs: does the Mac C runtime translate those
into a single \n ? Do unix native text files get read as a single long
unterminated line ? This would happen in both text and binary mode. Is
there even a difference between text and binary more on older Macs ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top