Designing fgetline - a perspective

R

Richard Harter

On Mon, 12 Nov 2007 02:17:13 GMT, David Thompson

[snip]
You can do
while( (fg = fgetline (...)) . status == 0 ){ ... }
/* or != 0, or > 0, or whatever your semantics is */

I thought of this, but I have never used it and didn't know
whether it would work. Is there a reason you put spaces around
the dot?
This is not usual practice (in C), and I'd want to get some experience
with it before deciding whether I PREFER it, but it definitely works.

FWIW it could be argued that it's analogous to the much more idiomatic
use in LISP of a multiple-valued return that silently collapses to the
first value if the caller doesn't ask for the rest.




Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
R

Richard Harter

No, the thing is that there is only one signal needed for 'OK', but
it is quite possible to return various error forms. The 0 == OK
matches this. It is a general practice in the standard library,
and avoids returning a separate error value.

I opine you're missing the point. Of course one can do all sorts
of things but a common C idiom is

while (somefunc(/* args */)) {/*body */}

If 0 were true and non zero false you could have all sorts of
termination codes. The convention would match the idiom. What
you did is of the form

while (somefunc(...) == SUCCESS_CODE)) {...}

which, while correct C, is clumsier and somewhat of a hack.


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
P

Paul Hsieh

The following is an exercise in thinking out a design.
Comments and thoughts are welcome.

[...]
What sort of return values should be available in the status
field? These are the successful ones that occur to me:

end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.

If the line contains a '\n' just before a '\0' doesn't that tell you
the same thing?
no_increase Normal read - either buffer was null or
else it was not increased.

If you have some sort of out-parameter that gives the target memory
length, then the user can deduce this themselves by comparing the
starting size with the ending size. This seems to be of incredibly
marginal use (I can't even think of a use) to justify explicitly
pointing this out redundant to other ways of determining it.
increase Normal read - the buffer size has been
increased.

This is not redundant with the above flag?
abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.

So if instead a line was read that ended without a '\n' before the
'\0' don't we already know this?
abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.

It just looks like you are conflating what the flags are trying to
indicate.

Similarly, file reading failures can be determined by looking at
errno() and ferror (). I don't know why you want to wrap all these
error easily deducible error conditions into a flag set.
In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:

no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0

These are clearly redundant with information available at the call
site.
Then there are the memory allocation failures:

bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.

Well ok, now *this* is a new error condition that is not obviously
deducible by other means from the call site.

Your emphasis of capturing every single kind of error seems misplaced
to me. You can already determine pretty much all these errors except
for a failure to allocate memory. Give that, a prototype that looks
more like:

long readline (FILE * fp,
char ** pline,
size_t * curralloc,
size_t maxlen);

where the length of the line read is returned on success and -1 or
some other negative number is returned on failure, seem more
appropriate. If *pline is NULL, then *curralloc must be 0, and a new
buffer will be allocated, otherwise realloc will be used to resize the
buffer and curralloc will be updated to the final parameter passed to
realloc. A maxlen of 0 indicates that there is no maximum length. If
pline itself is NULL then curralloc must be NULL and the input is not
stored anywhere.

So:

char * line = NULL;
size_t lalloc = 0, oldlalloc = 0;
size_t llen;

/* Naturally reuse a line buffer */
while (0 < (llen = readline (fp, &line, &lalloc, 0))) {
if (oldalloc < lalloc) {
/* Allocation was increased */
}
/* Check line[llen-1] for '\n' to see if an EOL was read */
/* Process the string line which was read */
oldlalloc = lalloc;
}
free (line);
if (llen < 0) {
if (ferror (fp)) {
/* IO error */
} else {
/* malloc or realloc failure or else the programmer
fed bad parameters that readline() could detect. */
}
}
/* else llen == 0 means EOF was reached with no other issues. */

So you can see that all the relevant error conditions are captured,
with only 4 simple parameters. While bad parameters and memory
allocation failures are aliased together, one requires just
development time debugging, while other may require deployment time
resolution. So the two cases should be naturally separated as the
ordinary process of development.

But this misses another point as well. By getting mired in this sort
of minutia you are missing out on other issues. What about inputting
passwords or reading from a socket? By only servicing FILE *'s you
are limiting your capabilities. Also what about inputing using
delimiters other than EOL? For example, a simplified CSV-like file
could have rows separated by '\n's , but then columns separated by
','s. It makes sense to be able to feed the output (a string) of a
line read into the input of another field read with different
parameters.

These issues are dealt with by taking a more primitive approach and
writing wrappers that are specialized to the programmer's requirement
here:

http://www.azillionmonkeys.com/qed/userInput.html
 
C

CBFalconer

Richard said:
I opine you're missing the point. Of course one can do all sorts
of things but a common C idiom is

while (somefunc(/* args */)) {/*body */}

If 0 were true and non zero false you could have all sorts of
termination codes. The convention would match the idiom. What
you did is of the form

while (somefunc(...) == SUCCESS_CODE)) {...}

which, while correct C, is clumsier and somewhat of a hack.

As long as 0 reports no error, you use:

while (!(err = somefunc(...))) { ... }
if (err != DONE) fixupcode();

:)
 
R

Richard Harter

Richard Harter wrote:
[snip]
Here we may have to disagree a bit. The trouble with a KISS
approach in this case is that error and efficiency go by the
wayside. The simple way is

char * getalife(FILE *);

Oddly enough, that's the signature of my line-getter.

Oddly enough I have one stashed away that I have used for
decades. I dare say you did a better job of coding it than I
did. It's the obvious thing to do.
Well, no. Mine reuses (and perhaps expands) the buffer at
each call. You only get to keep one line at at time "internal"
to the line-getter, which doesn't seem to be a problem: My usual
pattern is not to save the lines verbatim, but to parse them and
extract data. If you *do* want to save the lines verbatim, you
can't re-use the same buffer in any case (although in this case
mine would incur a string copy yours might be able to avoid).


True. (Shrug.) I have seldom found the line length an
interesting datum.

I accept that you have seldom found ....
True. (Shrug, perhaps with a touch less assuredness.) If
you know an upper limit on line length, fgets() is available.


"No place for an error return?" Mine returns NULL for end-of-
input, I/O error, or malloc() failure, which can be disambiguated
(if desired) by calling feof() and ferror(). Usually the program
only cares about normal-vs.-other, and the code looks like

while ((line = getline(stream)) != NULL) {
...
}
if (! feof(stream)) {
die_horribly();
}

Note that *some* kind of test is necessary in any event,
since the underlying getc() lumps end-of-input and error into
a single EOF return value.

I stand by "no place for an error return". Now it is quite true
for getline that errors disambiguate them with some code in the
user space. Although this is a somewhat a matter of style, my
view is that leaving error checking to the user is a bad idea.
There are two objections: The first is that error checking (and
disambiguation) is replicated throughout programs. Every user
must (should) do that which the utility routine could have done
once. The second is that all too often the error checking simply
isn't done, or is incomplete.

Incidentally, in your routine, what do you do about a premature
EOF (no final \n). Do you accept that prematurely terminated
final line as a valid line or do you treat it as an error? If
the latter, how is the user supposed to diambiguate it? One way
to do that is to include that final \n if it is present, but then
what if the user doesn't want a final \n?

Not true of mine.

My apologies, I should have worded that better. According to the
article on Heathfield's page, you return what I call "a dirty
copy" of the line. I've implemented that method myself. At some
recent point I was going to reuse some existing code and
repackage it and I looked at it and said to myself, self this
puppy isn't thread safe. It isn't even single thread safe. Very
simply, the certified integrity of the line vanishes the moment
the user calls any other user defined function.

One way to make it safer is to for getline to maintain a table of
streams and let each stream have its own buffer. I gather you
chose not to do so. I didn't either, but in retrospect I think
that not doing so was a mistake.

I acknowledge that my "simplicity trumps all" approach is not
The Answer for all situations, nor (obviously) for all programmers.
My misgiving about the design you are working on is that it seems
to be trying to be The Answer, and I doubt that's a reasonable goal.

Well, I like to think that I'm trying to think of everything that
one should think of first, and then whittle it down to size.
Maybe I need a better knife.



Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
R

Richard Harter

There is a middle path here. Consider:

char *get_a_line(FILE *file, struct whatever *control_block);

where the "control_block" parameter is optional, and may be
given as NULL. The "simple interface user" then writes:

while ((line = get_a_line(fp, NULL)) != NULL)
... work with the line ...

and ignores all distinctions between "complete" and "incomplete"
lines, "ordinary EOF" and "read error", and so on. The "complex
interface with control block" user creates and populates the control
block, and does:

while (get_a_line(fp, &cb), cb.status == OK)
... etc ...

or whatever is appropriate. (Note: the above assumes that the
return value used in the "simple interface" case is duplicated
somewhere in the control block, in this case.)

That's an excellent idea; thanks for the suggestion. One thing
that strikes me as a good idea is to hide a lot of stuff with an
opaque pointer. That is, the "control block" looks like:

struct control_block (
size_t maxlen; /* Maximum line length permitted */
size_t length; /* Length of returned lines */
long flags; /* Might use bit fields here */
struct private_data * pvt;
};

That last puppy holds things like buffer pointers and lengths
that the implementation needs as state information that persists
from call to call. I opine that it should be set NULL when the
control block is initially populated.

As a further point it might be best to also return the line
pointer and not have it otherwise visible to the user, i.e.,
usage would look like:

while(line = get_a_line(fp, &cb))


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
R

Richard Harter

On Mon, 12 Nov 2007 11:11:12 -0800, Paul Hsieh


[snip excellent comments]

I appreciate your comments for which many thanks. The discussion
has moved on a bit from the post that you commented upon. I have
a few general thoughts here.

My theory is that service routines should do error detection and
report errors. I opine that placing the onus of error detection
on the user of the service leads to a lot of unnecessary
replication of code and all, too often, failure to actually test
for errors.

Your suggestions for generalizing are interesting. In the
implementation I think it best to distinguish between different
kinds of "streams". Offering the potential for different kinds
of separator characters is probably a good idea.

One final thought: just because one has created a "swiss army
knife" utility as a basis doesn't mean that one is committed to
using it in all its multifaceted glory. One can always write
wrappers.


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
R

Richard Harter

[snip original post]

Here (unless I am persuaded otherwise) is my final take on the
interface. Many thanks to all that commented.

The getfline.h include file defines four things:

(1,2) There are two variables, flags and errno, that replace the
status word (they will be inside a structure.)

They are supported by two enumerated types called gfl_flags and
gfl_errors. These replace the macro definitions. After some
thought I decided that bit fields were a bad idea. The
enumerated types still invade the user space but in a nicer way.
The gfl_flags enum declaration looks like this:

enum gfl_flags {
gfl_clean = 0x1, gfl_eofok = 0x2, gfl_cut = 0x4,
gfl_trunc = 0x8, gfl_omit = 0x20, gfl_exit = 0x40,
gfl_log = 0x80, gfl_nomax = 0x100
};

The gfl_flags type expresses the same values as the corresponding
macros and are to be used in the same way. For example

cb.flags = gfl_eofok | gfl_cut | gfl_log;

The "nomax" flag is an addition. It turns off bounds checking.
The "clean" flag is a replacement for the "TRANS" flag. What is
happening here is that the default is changed to transient
copies. The gfl_errors enum declaration looks like this:

enum gfl_errors {
gfl_ok = 0, gfl_stream, gfl_bufsize, gfl_flags,
gfl_io, gfl_storage, gfl_corrupt
};

The usage is different from before. The errno variable will have
one of these values. Most of the argument errors are no longer
needed. The gfl_corrupt return has been added because it may be
possible for to corrupt the hidden state. The get_ok value for
errno says that was no error; a positive value says that there
was an error.

(3) A structure that looks like this:

struct gfl_cb {
struct gfl_private * pvt;
size_t length;
gfl_errors errno;
size_t maxlen;
int flags;
};

The gfl_private struct is new; it is an opaque pointer to a
struct created by getfline that is used to hold the state data
from one call to the next. The gfl_cb struct does not have to be
used (see prototypes below) but if it is it should be populated
as follows:

struct gfl_cb = {0,0,0,MAXLEN,FLAGS};

where MAXLEN is the maximum line length permitted, and FLAGS is
set as the or of the selected flags. MAXLEN may be 0 if the
nomax flag is selected. Arguments two and three don't need
initializing,; setting them to zero might be good practice.

The pvt pointer must be 0 (NULL) when the first line is
extracted; if it is not bad things may happen. It will be
cleared by getfline after the last line is extracted.

(4,5) Two prototypes that look like this:

char * getfline (FILE * fptr, struct gfl_cb * cb);
int gfl_terminate ( struct gfl_cb * cb);

Getfline is the principle function. If the second argument is
NULL getfline will run in clean copy mode with no bounds check
and will ignore a missing final EOL. It is the appropriate
choice for simple usage. If the second argument is not NULL it
must be correctly initiialized.

Gfl_terminate is there for a special case. Ordinarily getfline
will clean up its private data after the last line is extracted.
However the user may break out of the read loop before the last
line is read. If so, gfl_terminate should be called to clean up
the private data. Failure to do is not fatal; however it will
leak some memory. Return of zero is okay, nonzero is an error.
Example usage:

while (line = getfline(fptr, &cb)) {
... do stuff ...
if (something_happened) {
if (gfl_terminate(cb)) die_horribly();
break;
}
... do other stuff ...
}






Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
D

David Thompson

On Mon, 12 Nov 2007 02:17:13 GMT, David Thompson

[snip]
You can do
while( (fg = fgetline (...)) . status == 0 ){ ... }
/* or != 0, or > 0, or whatever your semantics is */

I thought of this, but I have never used it and didn't know
whether it would work. Is there a reason you put spaces around
the dot?
Only to emphasize it as the solution to the issue posed. They aren't
required lexically, nor are the others in this (small) example.

If I did/do use this in real code I would probably keep the space on
the left since the left operand (in this case) is a moderately complex
subexpression and I like it visually separated; I might well drop the
space on the right and use
(fg = blah) .status == 0

But that's only style and I wouldn't object to dropping both.

- formerly david.thompson1 || achar(64) || worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top