Designing fgetline - a perspective

R

Richard Harter

The following is an exercise in thinking out a design.
Comments and thoughts are welcome.

Writing a routine to read lines from a file is one of
those little design tasks that seems to create disagreement and
confusion. The routine has various names; in this article we
will call it fgetline. Here is a take on how to do it.

What we want is a routine that keeps reading and returning lines
from a file. A simple prototype for this function is

char * fgetline(FILE *);

There are some gotchas with this prototype - well, not really
gotchas, rather little bits of awkwardness and inefficiency. One

is that we are throwing away information that (usually) has to be

recomputed, namely the length of the line and, as we shall see,
status information. If we want to add the information to our
prototype there are several that to do it - it can be
passed back through the calling sequence or it can be returned by

the function. (We could also pass it to a global - ugh, don't do

it.) My vote is to have it all returned by the function which
means we create a struct (record) that looks like:

struct fgetline_info {
char *line;
size_t len;
int status;
};

There are various choices that could be made in the types -
season to taste. (A pointer to the struct could be passed in as
an argument if need be.)


A second issue is that there is no limit to how large the line
can be. A plausible way to provide a limit that is to pass one
in as an argument. If we do, our prototype looks like this:

struct fgetline_info fgetline(FILE * fptr, size_t maxsize);

This version has a little problem - where did the storage for the

line come from? One way to deal with that is for fgetline to
allocate storage for the line and for the calling routine to free

it when it is done with the line.

Okay, so how do we do this in fgetline? Well, there is a very
simple way to do it, one that has been reinvented time and time
again. We start out allocating a standard amount of space
(numbers like 32, 64, and 80 bytes are typical) for a line.
We read from the file until we either hit an EOL (end of line
marker), a failure to read, or we have read as many characters as

were allocated. If we haven't hit the EOL we reallocate the
array, commonly by doubling the size, and doing another read.

This works, but it is inefficient - we have to call malloc and
free for every line. One way to get around this is to use a
technique I call highwater buffering. In the file containing the

code for fgetline we have two file-scope variables:

static char * buffer = 0;
static size_t size = 0;

In fgetline we have some initialization code that allocates
buffer space when the buffer size is zero. Thereafter we use our

little doubling trick. The advantage of this scheme is that we
have at most a small number of calls to malloc (the number of
doublings needed to get to the largest line) and no calls to
free.

The very real disadvantage of this scheme is that it produces a
dirty copy of the line. By a dirty copy, I mean one that can be
scribbled on elsewhere before we are done with it. This will
happen whenever fgetline is called anywhere else with any FILE
argument before our next read. This is double plus ungood.

Is there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to

the prototype we get:

fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);

There are three kinds of copies of the line that we might get
from fgetline - clean, transient, and dirty. A clean copy is one

that has no encumbrances - it lasts until it is specifically
deleted. A transient copy is one that lasts until the next
call to fgetline to get another line from the current file.
(There may be another file being read elsewhere.)

What kind of copy should fgetline provide, clean or transient?
There are arguments for each choice. Clean copies are more
expensive but safer. Transient copies are (usually) cheaper. In

questions like this there is much to be said for the design
principle that:

When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.

So how do we provide a choice? That is simple; if the buffer
pointer in the calling sequence is NULL, fgetline must return a
clean copy; otherwise fgetline MAY return a transient copy. I
say "may" because it might have to increase the buffer size.
If it does the calling routine will have to check whether there
was a size increase and, if so, will have to free the returned
buffer. (The fgetline implementation can't realloc the passed
in buffer; it will have to do a malloc for the first resizing.)

It's probably best to have a value in the status field that
signifies the size has been increased; call that value
fg_increased and a normal read fg_normal. Then the usage for
transient copies might look like this (first cut):

...
struct fgetline_info fg;
char buffer[80];
int done = 0;
....
for(;!done;) {
...
fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
if (fg.status == fg_normal || fg.status == fg_increased) {
/* do stuff */
if (fg.status == fg_increased) free(fg.line);
} else done = 1;
}

If we want clean copies the corresponding loop body would be

fg = fgetline(fptr,0,0,FG_MAXSIZE);
if (fg.status == fg_normal) {
/* do stuff */
free(fg.line);
} else done = 1;

What sort of return values should be available in the status
field? These are the successful ones that occur to me:

end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.

no_increase Normal read - either buffer was null or
else it was not increased.

increase Normal read - the buffer size has been
increased.

abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.

abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.

In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:

no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0

Then there are the memory allocation failures:

bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.

When there is a memory allocation failure the line and len fields

hold what has been successfuly read.

There are various conventions one could use for assigning code
values for the possible return values. My view is that there
should be a bit that is set only if there was an error, a bit
that is set only if there was a memory allocation error, a bit
that is set only if there was a buffer increase, and a bit that
is set only if an EOL occurred.

The point of doing this is that we can have simple tests to check

whether an error occurred, whether some space has to be freed,
and whether the file read is complete.

As a final note, there is one minor decision left unsettled,
namely should the returned line include an EOL marker before the
string terminating 0. My take is that this matter is too trivial

to warrant adding a flag to the calling sequence, and that it
will be slightly less confusing if there is one present even if
it has to be manufactured. Perhaps someone has a convincing
argument one way or the other.


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
R

Richard Heathfield

Richard Harter said:

What we want is a routine that keeps reading and returning lines
from a file. A simple prototype for this function is

char * fgetline(FILE *);

There are some gotchas with this prototype

Check out http://www.cpax.org.uk/prg/writings/fgetdata.php in which I dealt
with these problems a few years back. Curiously, I too chose the name
fgetline (and fgetword, for token-based rather than line-based input).

You, um, wanna try a different name? Two fgetlines could get confusing.
 
T

Thad Smith

Richard said:
The following is an exercise in thinking out a design.
Comments and thoughts are welcome.

I enjoyed reading your design process.

Is there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to

the prototype we get:

fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);

It's probably best to have a value in the status field that
signifies the size has been increased; call that value
fg_increased and a normal read fg_normal. Then the usage for
transient copies might look like this (first cut):

...
struct fgetline_info fg;
char buffer[80];
int done = 0;
....
for(;!done;) {
...
fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
if (fg.status == fg_normal || fg.status == fg_increased) {
/* do stuff */
if (fg.status == fg_increased) free(fg.line);
} else done = 1;
}

That is too messy for my taste. I can see giving the flexibility of
using either a supplied buffer or allocating one, but to have the
function choose based on the size of the line makes the application too
error prone, IMO. If accepting a supplied buffer, I would only use
that, giving an error indication if it were too short. The
simplification in code is a good payoff for me.

What sort of return values should be available in the status
field? These are the successful ones that occur to me:

end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.

no_increase Normal read - either buffer was null or
else it was not increased.

increase Normal read - the buffer size has been
increased.

abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.

abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.

In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:

no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0

Then there are the memory allocation failures:

bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.

Add I/O error.
As a final note, there is one minor decision left unsettled,
namely should the returned line include an EOL marker before the
string terminating 0. My take is that this matter is too trivial

to warrant adding a flag to the calling sequence, and that it
will be slightly less confusing if there is one present even if
it has to be manufactured.

I prefer a flag. Otherwise you need to usurp some character for an EOF
marker. The constant EOF is intentionally outside (unsigned char)c for
all characters c (excepting implementations with sizeof int = 1).
 
R

Richard Harter

Richard Harter said:



Check out http://www.cpax.org.uk/prg/writings/fgetdata.php in which I dealt
with these problems a few years back. Curiously, I too chose the name
fgetline (and fgetword, for token-based rather than line-based input).

Thanks, I've read it; it's quite good; however it doesn't go into
some of the issues I was concerned with.
You, um, wanna try a different name? Two fgetlines could get confusing.

But that's the logical name - getline gets a line from stdin,
fgetline gets a line from an arbitrary file.



Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
R

Richard Heathfield

Richard Harter said:
On Fri, 12 Oct 2007 02:18:30 +0000, Richard Heathfield


Thanks, I've read it; it's quite good; however it doesn't go into
some of the issues I was concerned with.

I'm sure, but I wasn't particularly fussed about that - I just wanted to
draw your attention to the name clash.
But that's the logical name - getline gets a line from stdin,
fgetline gets a line from an arbitrary file.

Er, I agree that it's the logical name (obviously!). Oh, well - if you want
a name clash, a name clash you can have, I guess.
 
R

Richard Harter

I enjoyed reading your design process.

Thank you.
Is there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to

the prototype we get:

fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);

It's probably best to have a value in the status field that
signifies the size has been increased; call that value
fg_increased and a normal read fg_normal. Then the usage for
transient copies might look like this (first cut):

...
struct fgetline_info fg;
char buffer[80];
int done = 0;
....
for(;!done;) {
...
fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
if (fg.status == fg_normal || fg.status == fg_increased) {
/* do stuff */
if (fg.status == fg_increased) free(fg.line);
} else done = 1;
}

That is too messy for my taste. I can see giving the flexibility of
using either a supplied buffer or allocating one, but to have the
function choose based on the size of the line makes the application too
error prone, IMO. If accepting a supplied buffer, I would only use
that, giving an error indication if it were too short. The
simplification in code is a good payoff for me.

Hmmm. My take is that I want the function to give me a complete
line if at all possible but to do it in a way that minimizes
calls to malloc/realloc/free. Passing a supplied buffer does this
provided that we still do the expanding buffer trick. Treating a
too small buffer as an error would move a bunch of mess to the
calling program.

That said, your point about being too error prone is well taken.
It may be better to actually have two routines - one without an
input buffer and one with. Call the one with no user supplied
buffer fgetline and the one with the user supplied buffer
fgetline_hp (hp for high performance :)).
Add I/O error.


I prefer a flag. Otherwise you need to usurp some character for an EOF
marker. The constant EOF is intentionally outside (unsigned char)c for
all characters c (excepting implementations with sizeof int = 1).

I see that I was unclear. Here is the issue: When we return a
line do we include \n or not. For example, suppose the line is
"abc". Do we return abc\n\0 or abc\0? The "manufactured" comes
into play when the last line of a file isn't properly terminated
with a \n. If we are returning the EOL (not EOF) character then
we have to manufacture one that is not actually present in the
file. Upon reflection, it may be best to return abc\0.



Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
R

Richard Harter

Richard Harter said:
[snip]
But that's the logical name - getline gets a line from stdin,
fgetline gets a line from an arbitrary file.

Er, I agree that it's the logical name (obviously!). Oh, well - if you want
a name clash, a name clash you can have, I guess.

Hey, I'm easy. Do you have any good suggestions? getln, fgetln,
rdline, and readline are already in use.


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
L

Logan Shaw

Richard said:
Is there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to
the prototype we get:

fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);

There are three kinds of copies of the line that we might get
from fgetline - clean, transient, and dirty. A clean copy is one
that has no encumbrances - it lasts until it is specifically
deleted. A transient copy is one that lasts until the next
call to fgetline to get another line from the current file.
(There may be another file being read elsewhere.)

What kind of copy should fgetline provide, clean or transient?
There are arguments for each choice. Clean copies are more
expensive but safer. Transient copies are (usually) cheaper. In
questions like this there is much to be said for the design
principle that:

When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.

So how do we provide a choice? That is simple; if the buffer
pointer in the calling sequence is NULL, fgetline must return a
clean copy; otherwise fgetline MAY return a transient copy.

I agree 100% so far, but this is where I diverge. IMHO, this is
all a lot simpler if you let the caller make the choice up front
before fgetline() is ever called. How do you do this when the
length of the line is not known in advance? You let the user
manage the storage for the "meta info" about the buffer (the
address of the buffer, the size of it, etc.), and you track
everything about the buffer either in local variables or in
the storage that the user has supplied. If the user wants that
meta info to be static, they declare it static. If they want
to have a clean copy on every call, they make a new structure
to track a buffer every time.Also notice that free_buffer() leaves the buffer metadata in a
consistent state so that you can throw away the dynamic memory as
soon as you know you don't need it yet still call freadline() on it
again if you feel like it.

It looks like this (extraneous fields omitted for clarity):

struct LineBuffer {
char *buffer;
size_t buffer_size;
};

And the code looks like this:

void init_buffer(LineBuffer *lb) {
lb->buffer = NULL;
lb->buffer_size = 0;
}

int increase_buffer(LineBuffer *lb, size_t newsize) {
char *newbuffer = realloc(buffer, newsize);
if (newbuffer == NULL) return 0;

lb->buffer = newbuffer;
lb->buffer_size = newsize;
return 1;
}

void free_buffer(LineBuffer *lb) {
free(lb->buffer);
init_buffer(lb); /* leave in a consistent state */
}

int freadline(FILE *f, LineBuffer *lb) {
int pos = 0;

while (.... whatever ....) {
if (pos >= lb->buffer_size &&
! increase_buffer(lb, 2 * lb->buffer_size + 1)) {
return 0;
}

lb->buffer[pos++] = .... this character ....;
}
}

Now the client code can do whatever it wants. It doesn't have to worry
about conditionally freeing the old storage. Because the client is
managing the metadata about the buffer and because it is passing that
metadata to freadline(), the freadline() function can free() it if
necessary.

On a side note, since init_buffer() allocates no dynamic storage and
merely leaves the metadata in a consistent state, no error handling
is necessary, and only calls to freadline() have to worry about an
error, thus reducing the amount of error-handling code needed.

- Logan
 
A

Alf P. Steinbach

* Richard Harter:
The following is an exercise in thinking out a design.
Comments and thoughts are welcome.

Writing a routine to read lines from a file is one of
those little design tasks that seems to create disagreement and
confusion. The routine has various names; in this article we
will call it fgetline. Here is a take on how to do it.

What we want is a routine that keeps reading and returning lines
from a file. A simple prototype for this function is

char * fgetline(FILE *);

There are some gotchas with this prototype - well, not really
gotchas, rather little bits of awkwardness and inefficiency. One

is that we are throwing away information that (usually) has to be

recomputed, namely the length of the line and, as we shall see,
status information. If we want to add the information to our
prototype there are several that to do it - it can be
passed back through the calling sequence or it can be returned by

the function. (We could also pass it to a global - ugh, don't do

it.) My vote is to have it all returned by the function which
means we create a struct (record) that looks like:

struct fgetline_info {
char *line;
size_t len;
int status;
};

This looks like a C++ std::string except it has mixed in that "status"
field which has nothing to do with strings.

There are various choices that could be made in the types -
season to taste. (A pointer to the struct could be passed in as
an argument if need be.)


A second issue is that there is no limit to how large the line
can be. A plausible way to provide a limit that is to pass one
in as an argument. If we do, our prototype looks like this:

struct fgetline_info fgetline(FILE * fptr, size_t maxsize);

This version has a little problem - where did the storage for the

line come from? One way to deal with that is for fgetline to
allocate storage for the line and for the calling routine to free

it when it is done with the line.

Well, the C++ function is

std::eek:stream& std::getline( std::eek:stream&, std::string& );

As far as I can see this solves all of the problems mentioned.

What it doesn't solve is how to just ignore a line.

For that, streams have the ignore() member function.

OH! You have cross-posted to [comp.lang.c]! Well, darn, it shouldn't
be too difficult to translate the C++ solution to C.

Of course, then the programmer has to do what the compiler does for C++.

And I think that's very relevant: the functionality you want is
essentially to have all this stuff automated and enforced by the
compiler, like in C++ and more high level languages.

Cheers, & hth.,

- Alf
 
R

Richard Heathfield

Richard Harter said:
Richard Harter said:
[snip]
But that's the logical name - getline gets a line from stdin,
fgetline gets a line from an arbitrary file.

Er, I agree that it's the logical name (obviously!). Oh, well - if you
want a name clash, a name clash you can have, I guess.

Hey, I'm easy. Do you have any good suggestions? getln, fgetln,
rdline, and readline are already in use.

Ah. Er, um... Hmmm. The trouble is that fgetline *is* the logical name! :)

Okay, how about... oh. fgets has kinda gone, too, hasn't it?

Um... sgetline? (s for stream (the source) or for string (the target) at
your discretion).

Just a thought. Obviously, fgetline isn't trademarked or anything (well,
not as far as I know, anyway), so you're free to use that if you choose.
I'm not being 'precious' about it, honestly - but between you and me the
name clash is particularly bad because we can't even say "fgetline - you
know, Richard H's version"!
 
R

Richard Harter

Richard Harter wrote:
So how do we provide a choice? That is simple; if the buffer
pointer in the calling sequence is NULL, fgetline must return a
clean copy; otherwise fgetline MAY return a transient copy.

I agree 100% so far, but this is where I diverge. IMHO, this is
all a lot simpler if you let the caller make the choice up front
before fgetline() is ever called. How do you do this when the
length of the line is not known in advance? You let the user
manage the storage for the "meta info" about the buffer (the
address of the buffer, the size of it, etc.), and you track
everything about the buffer either in local variables or in
the storage that the user has supplied. If the user wants that
meta info to be static, they declare it static. If they want
to have a clean copy on every call, they make a new structure
to track a buffer every time.Also notice that free_buffer() leaves the buffer metadata in a
consistent state so that you can throw away the dynamic memory as
soon as you know you don't need it yet still call freadline() on it
again if you feel like it.

It looks like this (extraneous fields omitted for clarity):

struct LineBuffer {
char *buffer;
size_t buffer_size;
};

And the code looks like this:

void init_buffer(LineBuffer *lb) {
lb->buffer = NULL;
lb->buffer_size = 0;
}

int increase_buffer(LineBuffer *lb, size_t newsize) {
char *newbuffer = realloc(buffer, newsize);
if (newbuffer == NULL) return 0;

lb->buffer = newbuffer;
lb->buffer_size = newsize;
return 1;
}

void free_buffer(LineBuffer *lb) {
free(lb->buffer);
init_buffer(lb); /* leave in a consistent state */
}

int freadline(FILE *f, LineBuffer *lb) {
int pos = 0;

while (.... whatever ....) {
if (pos >= lb->buffer_size &&
! increase_buffer(lb, 2 * lb->buffer_size + 1)) {
return 0;
}

lb->buffer[pos++] = .... this character ....;
}
}

Now the client code can do whatever it wants. It doesn't have to worry
about conditionally freeing the old storage. Because the client is
managing the metadata about the buffer and because it is passing that
metadata to freadline(), the freadline() function can free() it if
necessary.

On a side note, since init_buffer() allocates no dynamic storage and
merely leaves the metadata in a consistent state, no error handling
is necessary, and only calls to freadline() have to worry about an
error, thus reducing the amount of error-handling code needed.

Many thanks for the comments. You have, I opine, the right idea
(using metadata) but your actual code doesn't do the job. One
problem is that your init_buffer initializes the buffer to NULL.
This means that the client can't supply a buffer either from the
stack or from static storage; it will have to come from the heap
which is what we want to avoid. We can get around that by having
a pair of pointer/length variables, e.g.

struct fg_meta {
char * buf0;
size_t len0;
char * buffer;
size_t length;
};

void init_fgmeta(fg_meta * fg; char *buf, size_t len) {
if (!buf) len = 0;
if (!len) buf = 0;
fg->buf0 = buf; fg->buffer = buf;
fg->len0 = len; fg->length = len;
};

with the obvious adjustment to free_fgmeta/free_buffer.

More importantly we don't want the user (client if you like) to
have to manage the meta data, though they can have the option to
do so if they like. The implication is that at initialization
time the user *tells* the package how it wants to do things,
presumably by passing a flag to the initialization routine. Does
the user want a clean copy each time? Then the user takes
responsibility for freeing the storage. Will the user accept
transient copies? Then they don't have to anything about freeing
storage.

The way I would do this is to define a structure that holds all
this information; call the pointer to it a handle. Then we have
three routines, one to initialize the structure, one to do the
actual reads (and whatever back room storage management is
needed) and one to close the read. The latter takes care of any
cleanup that needs to be done.

That's the alternative to passing bits and pieces through calling
sequences, and it probably is the road that should be taken.





Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
R

Richard Harter

Richard Harter said:
Richard Harter said:
[snip]

But that's the logical name - getline gets a line from stdin,
fgetline gets a line from an arbitrary file.

Er, I agree that it's the logical name (obviously!). Oh, well - if you
want a name clash, a name clash you can have, I guess.

Hey, I'm easy. Do you have any good suggestions? getln, fgetln,
rdline, and readline are already in use.

Ah. Er, um... Hmmm. The trouble is that fgetline *is* the logical name! :)

Okay, how about... oh. fgets has kinda gone, too, hasn't it?

Um... sgetline? (s for stream (the source) or for string (the target) at
your discretion).

Just a thought. Obviously, fgetline isn't trademarked or anything (well,
not as far as I know, anyway), so you're free to use that if you choose.
I'm not being 'precious' about it, honestly - but between you and me the
name clash is particularly bad because we can't even say "fgetline - you
know, Richard H's version"!

I did a bit of googling; freadline is pretty much free, so
freadline it is.


--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
M

Malcolm McLean

Richard Harter said:
The following is an exercise in thinking out a design.
Comments and thoughts are welcome.
Firstly let me say I think this is a massively valuable exercise. Any
competent C programmer should be able to write an fgetline() without bugs,
the cghallenge is in developing an itnerface for it.
My vote is to have it all returned by the function which
means we create a struct (record) that looks like:

struct fgetline_info {
char *line;
size_t len;
int status;
};
Sorry hopeless.
Caller wants
while(line = fgetline() )
{
processline();
}

He can't be bothered with fiddling rounf with the internals of structures.
A second issue is that there is no limit to how large the line
can be. A plausible way to provide a limit that is to pass one
in as an argument. If we do, our prototype looks like this:

struct fgetline_info fgetline(FILE * fptr, size_t maxsize);

This version has a little problem - where did the storage for the
line come from?
You're passing in a maximum size, so caller might as well make the line a
buffer in his scope, as in fgets().
In fgetline we have some initialization code that allocates
buffer space when the buffer size is zero. Thereafter we use our
little doubling trick.
Is this relevant? Apart for the problem of malicious, massive lines designed
to tie up resources, I don't think most people care whether the function can
be implemented efficiently or not.[ clen, transient, dirty copies ]
When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.
And the choice does unduly complicate usage. Caller doesn't want to be
bothered with all this.
What sort of return values should be available in the status
field? These are the successful ones that occur to me:

end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.

no_increase Normal read - either buffer was null or
else it was not increased.

increase Normal read - the buffer size has been
increased.

abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.

abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.

In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:

no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0

Then there are the memory allocation failures:

bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.

When there is a memory allocation failure the line and len fields
hold what has been successfuly read.
This is a real problem. Too many return values. You can't get away from
allocation failures unless you also redo your entire allocation strategy.
You can't prevent IO errors, though in practise they are unlikely. Premature
EOF is neither unlikely nor something you can avoid.
As a final note, there is one minor decision left unsettled,
namely should the returned line include an EOL marker before the
string terminating 0.
Why not just extend fgets().
You're passing a maximum size, which 99% of the time will be less than 1024
bytes, so caller can easily declare it in his scope. He can malloc a massive
buffer if expecting a massive line. The only thing we lose is that if there
is huge but legitimate line of unknown length, caller has to allocate more
memory.

The problem of line length is easily solved, pass in a length pointer /
return it. That leaves only the problem of what to do on overlong lines. The
ideal thing is to take as much error handling code out of caller as
possible.
So caller might want to exit, he might want to print a message to stderr, he
might want the function to refuse to return any data, or he might want to do
what fgets() does and essentially ignore the problem.

So

/*
fgetsx() - fgets extension.
Params; buff - buffer to hold line
maxlen - length of buffer
flags - how to handle overlong lines and IO errors
fp - pointer to an open file
Returns: number of characters placed in buff, or errorcode
*/
int fgetsx(char *buff, size_t maxlen, int flags, FILE *fp);

So now we've just got to decide the flags.

FGETS_EXIT - call exit on error
FGETS_CODE - return an error code on error
FGETS_STDERR - print a message to stderr on error
FGETS_IO - report IO errors
FGETS_OVERFLOW - report overflow errors
FGETS_EOF - report premature EOF as an error

FGETS_ALL - report all errors

Return value -1 = premature EOF
-2 - IO error
-3 - buffer overflow error

Typical call

while(len = fgetsx(buff, 1024, FGETS_EXIT| FGETS_STDERR | FGETS_ALL, fp))
{
/* here we know that buff is a valid normal line. Else program quits with
an error message */
}

while(ret = fgets(buff, 1024, FGETS_CODE | FGETS_ALL, fp) )
{
if(ret < 0)
{
switch(ret)
{
case EOF: /* premature EOF, handle */
case FGETS_IOERROR: /* io error, handle */
case FGETS_OVERFLOW: /* line too long, handle */
}
}
}
 
R

Richard Harter

Firstly let me say I think this is a massively valuable exercise. Any
competent C programmer should be able to write an fgetline() without bugs,
the cghallenge is in developing an itnerface for it.
Thanks.

Sorry hopeless.
Caller wants
while(line = fgetline() )
{
processline();
}

He can't be bothered with fiddling rounf with the internals of structures.

Point well taken, though I don't agree with "can't be bothered" -
some will and some won't. There are a couple of lessons here
though:

(a) The usage should be consistent with common practice. It is
very common in C to use loops like:
while(line = foo()) {...}
and
while((len = foo()) >= 0) {...}

so the interface should provide for it.

(b) The desires of the "the can't be bothered" school of
programming should be respected - at least to the extent of
providing usage options that produce results that aren't too
gross.
You're passing in a maximum size, so caller might as well make the line a
buffer in his scope, as in fgets().

I disagree. A common usage is read all of the lines of a file
and keep them in memory. If the buffers holding the lines are
all maximum size then you get memory bloat. The purpose of
maxsize is to serve as a sanity check.
Is this relevant? Apart for the problem of malicious, massive lines designed
to tie up resources, I don't think most people care whether the function can
be implemented efficiently or not.

I care. I like library code to be mean, lean, and efficient. I
don't like library code produces bloated resource usage.
[ clen, transient, dirty copies ]
When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.
And the choice does unduly complicate usage. Caller doesn't want to be
bothered with all this.

This user does. I dare say that most users don't know, don't
care, and don't understand that their code is bloated and uses an
order of magnitude more resources than necessary. Often it
doesn't matter - the cost of writing more efficient code can
be greater than the savings.

That said, programmers should (in my personal opinion) at least
understand the costs of various choices. Part of programming is
about understanding tradeoffs. I feel that one ought to know and
appreciate that it is cheaper to reuse space rather than
continually allocate and deallocate it from the heap.

This is a real problem. Too many return values. You can't get away from
allocation failures unless you also redo your entire allocation strategy.
You can't prevent IO errors, though in practise they are unlikely. Premature
EOF is neither unlikely nor something you can avoid.

These aren't return values - they are information flags, bits in
the status word. At least that's the way I would do it. I tend
to take the view that library code ought to have a way to inform
the user of all the happenings of interest. Of course it
shouldn't be shoved down the user's throat.

Why not just extend fgets().
You're passing a maximum size, which 99% of the time will be less than 1024
bytes, so caller can easily declare it in his scope. He can malloc a massive
buffer if expecting a massive line. The only thing we lose is that if there
is huge but legitimate line of unknown length, caller has to allocate more
memory.

That's one way to do things but I really don't like it in part
for reasons already discussed, and because I don't trust coding
decisions made on the " most of the time it won't be a problem"
theory. More precisely my feeling is that code "down at the
bottom" should be very solid.

Having done the motherhood and apple pie commercial we now return
to your regular station.
The problem of line length is easily solved, pass in a length pointer /
return it. That leaves only the problem of what to do on overlong lines. The
ideal thing is to take as much error handling code out of caller as
possible.
So caller might want to exit, he might want to print a message to stderr, he
might want the function to refuse to return any data, or he might want to do
what fgets() does and essentially ignore the problem.

So

/*
fgetsx() - fgets extension.
Params; buff - buffer to hold line
maxlen - length of buffer
flags - how to handle overlong lines and IO errors
fp - pointer to an open file
Returns: number of characters placed in buff, or errorcode
*/
int fgetsx(char *buff, size_t maxlen, int flags, FILE *fp);

So now we've just got to decide the flags.

FGETS_EXIT - call exit on error
FGETS_CODE - return an error code on error
FGETS_STDERR - print a message to stderr on error
FGETS_IO - report IO errors
FGETS_OVERFLOW - report overflow errors
FGETS_EOF - report premature EOF as an error

FGETS_ALL - report all errors

This idea I like - pass flags to the routine that tell it what to
do with errors.

Return value -1 = premature EOF
-2 - IO error
-3 - buffer overflow error

I dunno know about this one - using special values as flags is an
ancient programming trick, but in my experience it is error prone
and maintainance problems prone. The problems come from having
two distinct meanings for a variable.
Typical call

while(len = fgetsx(buff, 1024, FGETS_EXIT| FGETS_STDERR | FGETS_ALL, fp))
{
/* here we know that buff is a valid normal line. Else program quits with
an error message */
}

while(ret = fgets(buff, 1024, FGETS_CODE | FGETS_ALL, fp) )
{
if(ret < 0)
{
switch(ret)
{
case EOF: /* premature EOF, handle */
case FGETS_IOERROR: /* io error, handle */
case FGETS_OVERFLOW: /* line too long, handle */
}
}
}

I assume that second call should be ret = fgetsx.

Where is buff? If it is on the stack you're making transient
copies; if it is allocated each time (and you have to do the
allocation in the user code) you're making clean bloated copies.
Notice that in one call you've got len in one call and ret in
another.

Thanks for the comments - you've given me things to think about.



Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
 
M

Malcolm McLean

Richard Harter said:
I disagree. A common usage is read all of the lines of a file
and keep them in memory. If the buffers holding the lines are
all maximum size then you get memory bloat. The purpose of
maxsize is to serve as a sanity check.
That's a good point.
If we're going to support reading the whole file, efficiently, which is
worth doing, it means we should also support reading in a file structure in
only a few lines.

while(1)
{
fg = fgetline(fptr, 0, 0, FG_MAXSIZE);
if(fg.status != fg_normal)
{
if(fg.status == end_of_file)
break;
/* error handling code */
}
temp = realloc(lines, (N+1) * sizeof(char *));
if(!temp)
{
/* error handling code */
}
lines = temp;
lines[N] = fg.line;
N++;
}

it's not an utter disaster, but it is too much.

How about.

while(line = fg_getline(fptr, 0, 0, FG_MAXSIZE))
{
temp = realloc(lines, (N+1) * sizeof(char *));
if(!temp)
{
/* error-handline code here */
}
lines = temp;
lines[N] = line;
N++;
}
/* error handler if fg_getline encounters a problem */
if(fg_error(fptr))
{
switch(fg_error(fptr))
{
}
}

Don't really like that either. We've got the while() actually testing the
condition, and we've moved some but not all of the error handling code out
of the loop, but it still isn't a boiler-plate idiom.

The other alternative is to have two functions, one for getting a single
line, one for reading a whole file.
 
C

CBFalconer

Malcolm said:
That's a good point. If we're going to support reading the whole
file, efficiently, which is worth doing, it means we should also
support reading in a file structure in only a few lines.
.... snip code example ...

Here is another example, from the testing suite for ggets. Note
the simplicity and safety of capturing the entire file in memory
(assuming sufficient memory exists). No excess memory is used.
ggets.zip is available at:

<http://cbfalconer.home.att.net/download>

/* File freverse.c - testing ggets */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "ggets.h"

typedef struct node {
struct node *next;
char *ln;
} node, *nodeptr;

/* ======================= */
/* reverse string in place */
/* return length of string */
size_t revstring(char *stg)
{
char *last, temp;
size_t lgh;

lgh = strlen(stg);
if (lgh > 1) {
last = stg + lgh; /* points to '\0' */
while (--last > stg) {
temp = *stg; *stg++ = *last; *last = temp;
}
}
return lgh;
} /* revstring */

/* ======================= */
/* reverse file */
int main(void)
{
char *line;
nodeptr root, temp;
int lncount, chcount;

fprintf(stderr, "Reversing stdin to stdout\n");
root = NULL;
lncount = chcount = 0;
while (0 == ggets(&line)) {
if (NULL == (temp = malloc(sizeof *temp))) {
fprintf(stderr, "Out of memory\n");
return EXIT_FAILURE;
}
temp->next = root;
chcount += revstring(line);
lncount++;
temp->ln = line;
root = temp;
}

/* file captured, and in reverse ordered list */

while (root) {
(void)puts(root->ln);
temp = root;
root = root->next;
free(temp->ln);
free(temp);
}
fprintf(stderr, "%d chars in %d lines\n", chcount, lncount);
return 0;
} /* main */
/* END freverse.c - testing ggets */
 
¬

¬a\\/b

In data Sun, 14 Oct 2007 11:23:16 -0400, CBFalconer scrisse:
/* File freverse.c - testing ggets */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "ggets.h"

typedef struct node {
struct node *next;
char *ln;
} node, *nodeptr;

/* ======================= */
/* reverse string in place */
/* return length of string */
size_t revstring(char *stg)
{
char *last, temp;
size_t lgh;

lgh = strlen(stg);
if (lgh > 1) {
last = stg + lgh; /* points to '\0' */
while (--last > stg) {
temp = *stg; *stg++ = *last; *last = temp;
}
}
return lgh;
} /* revstring */

/* ======================= */
/* reverse file */
int main(void)
{
char *line;

so it is ok even if here line!=0 ?
 
R

Richard Harter

Richard Harter said:
I disagree. A common usage is read all of the lines of a file
and keep them in memory. If the buffers holding the lines are
all maximum size then you get memory bloat. The purpose of
maxsize is to serve as a sanity check.
That's a good point.
If we're going to support reading the whole file, efficiently, which is
worth doing, it means we should also support reading in a file structure in
only a few lines.

while(1)
{
fg = fgetline(fptr, 0, 0, FG_MAXSIZE);
if(fg.status != fg_normal)
{
if(fg.status == end_of_file)
break;
/* error handling code */
}
temp = realloc(lines, (N+1) * sizeof(char *));
if(!temp)
{
/* error handling code */
}
lines = temp;
lines[N] = fg.line;
N++;
}

it's not an utter disaster, but it is too much.

How about.

while(line = fg_getline(fptr, 0, 0, FG_MAXSIZE))
{
temp = realloc(lines, (N+1) * sizeof(char *));
if(!temp)
{
/* error-handline code here */
}
lines = temp;
lines[N] = line;
N++;
}
/* error handler if fg_getline encounters a problem */
if(fg_error(fptr))
{
switch(fg_error(fptr))
{
}
}

Don't really like that either. We've got the while() actually testing the
condition, and we've moved some but not all of the error handling code out
of the loop, but it still isn't a boiler-plate idiom.

The other alternative is to have two functions, one for getting a single
line, one for reading a whole file.

To be honest, I don't much like either alternative. However that
may merely be a case of I like my wheel better. I have come up
with an alternative that I like, but I haven't gone through the
work of writing it all up. The essence is to have three
routines, an open, a read, and a close, along with an include
file, and an opaque struct to hold state. The usage looks
something like this:

#include "rdfline.h"
....
struct rdf_info *rdf; /* Opaque, allocated by fg_open */
FILE * fptr;
char * line;
size_t len;
int status;

...

if (!rdf_open(rdf,fptr,&line,&len,&status,maxsize,flags)) {
/* Can't open - probably fopen failed */
/* Do something about problem and goaway */
}
...
while (rdf_read(rdf)) {
/* Optional error test code */
/* Do stuff with line */
}
/* Optional error test code */
rdf_close(rdf);

The flags variable uses various bits to set flags that control
things like error handling, memory management, and whether to
produce clean or transient copies. Passing in a 0 uses the
defaults which are:

(a) The package does nothing about errors (writing to a log file,
etc) except set an error bit in the status word.

(b) A clean copy is produced; the user is responsible for freeing
it.

(c) rdf_read reads as much as it can and doesn't worry about a
premature EOF or memory failure - set a flag bit if you want it
to care.

More later.




Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
M

Malcolm McLean

CBFalconer said:
Here is another example, from the testing suite for ggets. Note
the simplicity and safety of capturing the entire file in memory
(assuming sufficient memory exists). No excess memory is used.
ggets.zip is available at:
The ggets() interface isn't too bad at all. I didn't think so before this
thread, but looking at the alternatives, its deficiencies are quite minor.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,734
Latest member
manin

Latest Threads

Top