reading from a text file

K

Keith Thompson

Chris Torek said:
In this case, this is just what you want. If you were actually
trying to interpret whole input lines -- as is often the case when
reading input from a human being who is typing commands -- it is
probably not what you want, as the loop might look more like:

while (fgets(buf, sizeof buf, fp) != NULL) {
... code to interpret a command ...
}

and you probably do not want to interpret "he", then "ll", then
"o\n" as three separate commands. In this case you would (a) need
a bigger buffer, and (b) need to double-check to see whether the
human managed to type in an overly long input line despite the
bigger buffer.

And (c) decide what the program should do if the human types in an
overly long input lines. There are numerous possibilities: silently
discard the extra characters, print an error message and abort, print
an error message and continue, build up a longer string containing all
the input (probably using realloc()). In a small toy program, you can
get away with ignoring the issue. In the real world, you had better
decide how to handle it, and write and test the code to do it.

BTW, there are a number of implementations floating around of
functions that read an input line of arbitrary length into a
dynamically allocated buffer.
 
J

Jordan Abel

bildad said:
[...]
This is my solution, after research. Criticism welcome.

Not bad, but I do have a few comments.

#include <stdio.h>

#define MAX_LEN 120

Obviously this is arbitrary (as it must be). If you haven't already,
you should think about what happens if the input file contains lines
longer than MAX_LEN characters. Since you're using fgets(), the
answer is that it works anyway, but you should understand why. Read
the documentation for fgets() and work through what happens if an
input line is very long.
K&R2, p.164, 7.7, par.1:

char *fgets(char *line, int maxline, FILE *fp)

"at most maxline-1 characters will be read."

I changed MAX_LEN to test this but it still seemed to work fine. The only
documentation I have is K&R2 and King's C Programming. Am I looking in the
wrong place. I googled "fgets()" and "c programming fgets()" but didn't
find anything relevant (at least to me).

Right. Suppose an input line is 300 characters long. Your call to
fgets() will read 119 characters; the resulting buffer will contain a
valid string terminated by a '\0' character, but it won't contain a
newline. Your call to fputs() or printf() will print this partial
line.

Speaking of fgets(), i've never liked that it doesn't cope with
embedded nulls. The old gets() function from unix v7 did:

I have converted this code to ANSI to attempt to recover some sense
of on-topicness - it was originally in k&r c and used aspects of a
pre-stdio i/o library. the original can be found at
http://minnie.tuhs.org/UnixTree/V6/usr/source/iolib/gets.c.html
and the applicable license at
http://www.tuhs.org/Archive/Caldera-license.pdf

int gets (char *s) {
char *p;
extern FILE *stdin;
p=s;
while ((*s = getc(stdin)) != '\n' && *s != '\0')
/* ^^^^^^^^^^^^^*/
s++;
if (*p == '\0') return (0);
*s = '\0';
return (p);
}

Why is it that feature was removed, anyway?
 
M

Mark McIntyre

Speaking of fgets(), i've never liked that it doesn't cope with
embedded nulls.

Well, it is designed to read in a string which by definition can't
contain a null, so... Data with embedded nulls isn't text and should
probably be fread instead.
The old gets() function from unix v7 did:

Unless I'm missing something, the code you posted won't read an
embedded null either. It stops as soon as it encounters one, and
returns a string consisting of every character up to the null.
int gets (char *s) {
char *p;
extern FILE *stdin;

For what its worth, this requires FILE to be defined, so you must
include the appropriate header. Same applies to getc below (7.1.4 p2).
p=s;
while ((*s = getc(stdin)) != '\n' && *s != '\0')

assignment of int to char - possible loss of data - what if getc
returned EOF?
/* ^^^^^^^^^^^^^*/
s++;
if (*p == '\0') return (0);
*s = '\0';
return (p);

This is undefined behaviour since int is not compatible with char*.

You may also want to consider that p is out of scope once gets()
returns, and therefore may point to junk.
Why is it that feature was removed, anyway?

Perhaps because by definition, a string can't contain a null. :)
 
J

Jordan Abel

Well, it is designed to read in a string which by definition can't
contain a null, so... Data with embedded nulls isn't text and should
probably be fread instead.

But suppose a parity error on the terminal causes a zero byte to be
read from the keyboard? This isn't a perfect solution, but it's
better than continuing to read past the null and the remaining data
being lost
Unless I'm missing something, the code you posted won't read an
embedded null either. It stops as soon as it encounters one, and
returns a string consisting of every character up to the null.

as opposed to modern fgets, which keeps on going and data beyond the
null to the end of the line [or the count] is lost.
For what its worth, this requires FILE to be defined, so you must
include the appropriate header. Same applies to getc below (7.1.4 p2).

meh - it was an int originally. and cgetc was implicitly declared.
assignment of int to char - possible loss of data - what if getc
returned EOF?

Not my code. Probably Dennis Ritchie's. or Ken Thompson's. Looking
at the other source, I suspect EOF hadn't been invented yet and 0
doubled for the purpose.
This is undefined behaviour since int is not compatible with char*.

it's compatible on a pdp-11.
You may also want to consider that p is out of scope once gets()
returns, and therefore may point to junk.

You are incorrect. it's assigned from a parameter.
Perhaps because by definition, a string can't contain a null. :)

Exactly. This code did not return a string containing a null.
fgets() on modern systems attempts to.
 
M

Mark McIntyre

But suppose a parity error on the terminal causes a zero byte to be
read from the keyboard?

Suppose a passing asteroid causes a massive magnetic spike and
generates spurious data. When was the last time you experienced
either?
Unless I'm missing something, the code you posted won't read an
embedded null either. It stops as soon as it encounters one, and
returns a string consisting of every character up to the null.

as opposed to modern fgets, which keeps on going and data beyond the
null to the end of the line [or the count] is lost.

I'm sorry, I thought your argmument was that you /wanted/ fgets to
read nulls.
it's compatible on a pdp-11.

So what?
You are incorrect. it's assigned from a parameter.

My mistake.
Exactly. This code did not return a string containing a null.
fgets() on modern systems attempts to.

if the file you're reading from was opened in text mode, it can't
strictly contain nulls. If it was opened in binary mode, you're using
the wrong function.
 
F

Frodo Baggins

Keith said:
bildad said:
[...]
This is my solution, after research. Criticism welcome.
Right. Suppose an input line is 300 characters long. Your call to
fgets() will read 119 characters; the resulting buffer will contain a
valid string terminated by a '\0' character, but it won't contain a
newline. Your call to fputs() or printf() will print this partial
line.

Think about what happens when you all fgets() again. You still have
the rest of the line waiting to be read, and the next fgets() gets the
next 119 characters of the line, which you then print.

On the *next* call to fgets(), you read the remainder of the long
input line, including the newline, and you then print it. You've read
and printed the entire line, but you've done it in 3 chunks.

If all you're doing with each result from fgets() is printing it, it
doesn't matter that it might take several calls to fgets() to read the
whole line. If you're doing more processing than that (as you
typically would in a real-world program), it could become a problem.

hi
call fflush(stdin) after the first fgets() call.
Regards,
Frodo Baggins
 
J

Jordan Abel


So it's not my code anyway, and it's pre-ansi :p
if the file you're reading from was opened in text mode, it can't
strictly contain nulls. If it was opened in binary mode, you're
using the wrong function.

The standard forbids text files to contain nulls? Or it allows them
to fail to contain them? I believe the latter is true but not the
former.
 
J

Jordan Abel

hi
call fflush(stdin) after the first fgets() call.
Regards,
Frodo Baggins

That is incorrect. There is a small possibility it may work if
reading from a terminal, and an even smaller possibility it may work
if reading from a pipe. This should not be mistaken for it being
defined behavior or good programming practice. And in any case it
will almost certainly not work if reading from a file.
 
M

Mark McIntyre

So it's not my code anyway, and it's pre-ansi :p

I was assuming that when you said that you'd converted it to ISO C,
you had actually done that....
The standard forbids text files to contain nulls? Or it allows them
to fail to contain them? I believe the latter is true but not the
former.

It probably does neither. If it contains nulls, its definitionally not
a text file since null isn't a printable character.
 
K

Keith Thompson

Frodo Baggins said:
hi
call fflush(stdin) after the first fgets() call.
Regards,
Frodo Baggins

And what exactly do you expect that to accomplish?

fflush() is not defined for input streams. See question 12.26 in the
C FAQ.

The text version of the C FAQ, available at
<ftp://ftp.eskimo.com/u/s/scs/C-faq/faq.gz>, is more up to date than
the HTML version and goes into more detail on this:

] 12.26a: How can I flush pending input so that a user's typeahead isn't
] read at the next prompt? Will fflush(stdin) work?
]
] A: fflush() is defined only for output streams. Since its
] definition of "flush" is to complete the writing of buffered
] characters (not to discard them), discarding unread input would
] not be an analogous meaning for fflush on input streams.
] See also question 12.26b.
]
] References: ISO Sec. 7.9.5.2; H&S Sec. 15.2.
]
] 12.26b: If fflush() won't work, what can I use to flush input?
]
] A: It depends on what you're trying to do. If you're trying to get
] rid of an unread newline or other unexpected input after calling
] scanf() (see questions 12.18a-12.19), you really need to rewrite
] or replace the call to scanf() (see question 12.20).
] Alternatively, you can consume the rest of a partially-read line
] with a simple code fragment like
]
] while((c = getchar()) != '\n' && c != EOF)
] /* discard */ ;
]
] (You may also be able to use the curses flushinp() function.)
]
] There is no standard way to discard unread characters from a
] stdio input stream, nor would such a way necessarily be
] sufficient, since unread characters can also accumulate in
] other, OS-level input buffers. If you're trying to actively
] discard typed-ahead input (perhaps in anticipation of issuing a
] critical prompt), you'll have to use a system-specific
] technique; see questions 19.1 and 19.2.
]
] References: ISO Sec. 7.9.5.2; H&S Sec. 15.2.
 
B

bildad

And (c) decide what the program should do if the human types in an
overly long input lines. There are numerous possibilities: silently
discard the extra characters, print an error message and abort, print
an error message and continue, build up a longer string containing all
the input (probably using realloc()). In a small toy program, you can
get away with ignoring the issue. In the real world, you had better
decide how to handle it, and write and test the code to do it.

This is where I am right now. I can't say it's bullet-proof but I think it
handles two situations safely. I'm working on malloc and realloc with
little success at this point. Thanks for the suggestions. I'm trying to
implement them.

void CopyFile(FILE *fp)
{
char buff[MAX_LEN]; /* 120 */
//char *p;

//p = malloc(fgets(buff, MAX_LEN + 1, fp));

if (fgets(buff, MAX_LEN, fp)) {
fputs(buff, stdout);
exit(EXIT_SUCCESS);
} else {
fputs("Error: Program Aborting", stdout);
exit(EXIT_FAILURE);
}
}
BTW, there are a number of implementations floating around of
functions that read an input line of arbitrary length into a
dynamically allocated buffer.

Thank you. I'll search for them.
 
K

Keith Thompson

bildad said:
And (c) decide what the program should do if the human types in an
overly long input lines. There are numerous possibilities: silently
discard the extra characters, print an error message and abort, print
an error message and continue, build up a longer string containing all
the input (probably using realloc()). In a small toy program, you can
get away with ignoring the issue. In the real world, you had better
decide how to handle it, and write and test the code to do it.

This is where I am right now. I can't say it's bullet-proof but I think it
handles two situations safely. I'm working on malloc and realloc with
little success at this point. Thanks for the suggestions. I'm trying to
implement them.

void CopyFile(FILE *fp)
{
char buff[MAX_LEN]; /* 120 */
//char *p;

//p = malloc(fgets(buff, MAX_LEN + 1, fp));

if (fgets(buff, MAX_LEN, fp)) {
fputs(buff, stdout);
exit(EXIT_SUCCESS);
} else {
fputs("Error: Program Aborting", stdout);
exit(EXIT_FAILURE);
}
}

Ok, this *definitely* isn't what you want. You attempt to read and
write the first line of the file (or the first 119 characters if the
line is longer than that) -- and then you abort the program, whether
it was successful or not.

Calling exit() from within your function probably isn't a good idea.
Calling exit() from within your function if it succeeds definitely
isn't a good idea. If the function is intended for general use, you
might want to return a result indicating whether it was successful,
and leave it to the caller to decide how to deal with errors. One
common convention is to return 0 for success, non-0 for any error
(which allows you to enhance the function to specify different kinds
of errors).
 
D

Dik T. Winter

> On Sun, 30 Oct 2005 17:43:07 +0000 (UTC), in comp.lang.c , Jordan Abel

>
> It probably does neither. If it contains nulls, its definitionally not
> a text file since null isn't a printable character.

It is still a text file since text files can contain non-printable
characters. But reading a file that contains NUL characters with
fgets is not really a good idea. fgetc will give you everything you
need.
 
P

pete

Mark said:
If it contains nulls, its definitionally not
a text file since null isn't a printable character.

Your reasoning is invalid.
Text files may contain more than printable characters.

isprint('\n') == 0

new-line characters aren't out of place in a text file.
 
D

Dave Thompson

On Sun, 30 Oct 2005 17:43:07 +0000 (UTC), Jordan Abel
The standard forbids text files to contain nulls? Or it allows them
to fail to contain them? I believe the latter is true but not the
former.

Basically the latter. What's actually in a file is out of scope of the
standard, and on some (rare) systems can in fact differ substantially
from what the program sees. What is in scope is that if you write out
data on a text stream to a (text) file and read it back (implicitly
without anything else changing the file) you are guaranteed to get the
same data (and thus the file must contain or represent it somehow) if:
you use only printing characters, HT and NL (which excludes among
others null); you don't have trailing spaces on a line; the
implementation may require that the last line be terminated by NL, and
may limit line length to no less than 254 characters including NL.
(And, as always, you don't exceed any resource limits, e.g. it's
permitted and reasonable to have a limit on file size.)

It is undefined by omission what happens if you violate these
restrictiions; as for all UB the implementation may choose to make it
work, and probably will if the (OS) filesystem can easily do so.
- David.Thompson1 at worldnet.att.net
 
J

Joe Wright

Dave said:
On Sun, 30 Oct 2005 17:43:07 +0000 (UTC), Jordan Abel



Basically the latter. What's actually in a file is out of scope of the
standard, and on some (rare) systems can in fact differ substantially
from what the program sees. What is in scope is that if you write out
data on a text stream to a (text) file and read it back (implicitly
without anything else changing the file) you are guaranteed to get the
same data (and thus the file must contain or represent it somehow) if:
you use only printing characters, HT and NL (which excludes among
others null); you don't have trailing spaces on a line; the
implementation may require that the last line be terminated by NL, and
may limit line length to no less than 254 characters including NL.
(And, as always, you don't exceed any resource limits, e.g. it's
permitted and reasonable to have a limit on file size.)

It is undefined by omission what happens if you violate these
restrictiions; as for all UB the implementation may choose to make it
work, and probably will if the (OS) filesystem can easily do so.
- David.Thompson1 at worldnet.att.net

I think it's a little simpler. Let's stick to Unix for illustration. All
files in Unix are binary. The bytes written on the file are exactly the
ones in memory. There is no difference in the representation. The
Standard allows "rb", "rt" and "wb", "wt" modes for fopen() but the have
no difference with "r" and "w" in Unix. The have effect on files written
by or for other operating systems (Windows, Apple, other), not Unix.

In the old days Unix and C were meant for each other and used ASCII as
the tie that binds. A text file for Unix is one that consists of ASCII
characters. ASCII characters (bytes) have values 0..127 inclusive.

A text file consists of lines of characters. A line consists of 0 or
more characters ending in a new line (NL) character.

Whether the last line must end in NL is implementation defined. The C
Standard doesn't care and neither do I if I am reading. I always write
the NL at the end of the last line of my text files. The char NL has
ASCII value 10.

Note that char '\0' (NUL) is valid in a text file. It has no special
meaning. Anyone writing NUL to a text file should be shot.

The C 'string' is a memory thing, an array of char terminated with 0 or
'\0'. There are no strings in a text file, even it there is a NUL.

Yeah, simpler. Sorry for that.
 
R

Richard Bos

Joe Wright said:
I think it's a little simpler. Let's stick to Unix for illustration. All
files in Unix are binary. The bytes written on the file are exactly the
ones in memory. There is no difference in the representation. The
Standard allows "rb", "rt" and "wb", "wt" modes for fopen() but the have
no difference with "r" and "w" in Unix.

Only "rb" and "wb"; text is the default, and not specified. "w+", "r+",
"rb+", "r+b", "wb+", "w+b", "a", "ab", "a+", "ab+", "a+b" and "0"[1] are
also allowed.
Of course, you could argue that the Standard "allows" the use of "rt",
since it means undefined behaviour, and undefined behaviour does not
mandate a crash. By that norm, fflush()ing input files or writing
through null pointers is also "allowed", so I don't think it counts.

Richard

[1] Ok, so maybe not that one. And there are no "w" or "r" blood groups.
 
A

Antonio Contreras

Richard said:
googler said:


Others have already answered your question, but nobody appears to have
pointed out yet that...


...in C, main returns int, not void. This is a common error, and those who
commit it often find it hard to believe that it's wrong. Nevertheless, no C
compiler is required to accept a main function that returns void unless it
specifically documents that acceptance - which few, if any, do.

MCC18 (Microchip C Compiler for the 18XXX family) actually requires
main to be defined as void main (void). And it makes sense, because
there is no OS to which you could return a value or from which to take
parameters. Many compilers for microcontrollers also have this
requirement, or simply ignore the return statement from main for given
reasons.
 
R

Richard Heathfield

Antonio Contreras said:
MCC18 (Microchip C Compiler for the 18XXX family) actually requires
main to be defined as void main (void). And it makes sense, because
there is no OS to which you could return a value or from which to take
parameters. Many compilers for microcontrollers also have this
requirement, or simply ignore the return statement from main for given
reasons.

I should have excluded freestanding implementations from my rather sweeping
statement.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,942
Members
47,476
Latest member
blackwatermelon

Latest Threads

Top