fgets - design deficiency: no efficient way of finding last character read

S

santosh

Eric Sosman said:
    Tell me, though: Are you using a QWERTY keyboard, despite all its
drawbacks?  Legend[*] has it that QWERTY was chosen *on purpose* to
slow down typists in the days when too much speed led to mechanical
jams.  On today's keyboards that's not a problem -- So, are you still
using a nineteenth-century keyboard layout? If so, ponder your reasons
for not changing to something more modern, and see if those reasons
shed any light on why people still put up with the Standard Warts And
All Library.

The same topic came up here back in 2002.  Here's a new copy of
what I posted back then:

Have you used a mechanical typewriter? I have. <snip>

Yes, we've still got one! A 1932 manufactured Remington. Used it a lot
during the 90s. It's still in excellent condition except that I'm
unable to acquire a ribbon anywhere.

<Good explanation!>
 
B

BartC

santosh said:
Eric Sosman said:
Tell me, though: Are you using a QWERTY keyboard, despite all its
drawbacks? Legend[*] has it that QWERTY was chosen *on purpose* to
slow down typists in the days when too much speed led to mechanical
jams. On today's keyboards that's not a problem -- So, are you still
using a nineteenth-century keyboard layout? If so, ponder your reasons
for not changing to something more modern, and see if those reasons
shed any light on why people still put up with the Standard Warts And
All Library.

The same topic came up here back in 2002. Here's a new copy of
what I posted back then:

Have you used a mechanical typewriter? I have. <snip>

Yes, we've still got one! A 1932 manufactured Remington. Used it a lot
during the 90s. It's still in excellent condition except that I'm
unable to acquire a ribbon anywhere.

I've got an Underwood No. 5 - from 1931. I use it for addressing envelopes,
as it would probably take a week of trial and error (and a wastepaper bin
full of trashed envelopes) to do it on my laser printer.
 
L

lawrence.jones

Ben Pfaff said:
To conclude: mechanical QWERTY typewriters are at the same time
an example of optimization for the common case and inherently
flawed because of the remaining race condition. This is a great
example of a tradeoff that you should not make when you design a
program!

I would assert that it's a great example of the kind of tradeoffs you
frequently *have* to make when designing programs! :)
 
J

Jorgen Grahn

Correct, but I would not read those huge lines, because the '\n' is
not the logical divider.

True; you use fgets() if you can expect lines to be reasonably short.
Even then you have to handle the case where one extra long doesn't fit
into your buffer.

For the original complaint, I can see many scenarios where you:

fgets()
parse the line (in whatever way your application needs)
during the parsing, discover the lack of an \n
try to fgets() again and restart the parsing
I however want a nice routine (which I have to code myself), which
uses realloc, to adjust a buffer to fit everything until the '\n'.
C standard lib does not have anything like this - so I have to code it
myself.

I bet C++ has something useful that one could use. It seems that many
went into C++, to make it the huge bloated monster that it is! But
still seems worth a look, to relieve me from having to handle this
stuff at the basic level.

This is the second time you've made this threat. You can't do *both*
that and insulting the C++ programmers here with things like "huge
bloated monster".

Anyway, I can't see that with your extreme attention to micro-
inefficiencies, you would be comfortable with C++ -- it solves these
things with dynamic memory allocations. Do you really prefer that to
a strlen() in an already warm data cache?

(And do you really believe a strlen() is significant compared to the
I/O that preceded it?)

/Jorgen
 
J

jononanon

***How can one EFFICIENTLY determine if the last character is '\n'??



That's relatively easy - so long as you don't need to know where the

'\n' is.






The following approach uses strchr() rather than strlen(), so it

technically meets your specification. However, I presume you would have

the same objections to strchr() as you do to strlen(). I'd like to point

out, however, that it uses strchr() only once per file, which seems

efficient enough for me. If you're doing so little processing per file

that a single call to strchr() per file adds significantly to the total

processing load, I'd be more worried about the costs associated with

fopen() and fclose() than those associated with strchr().



The key point is that a successful call to fgets() can fail to read in

an '\n' character only if fgets() meets the end of the input file, or

the end of your buffer, both of which can be checked for quite

efficiently. If it reaches the end of your buffer, there's one and only

one place where the '\n' character can be, if one was read in.

Therefore, it's only at the end of the file that a search is required.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])

char buf[6];
FILE *fp = stdin;



buf[(sizeof buf)-1] = 1; // any non-zero value will do.


while (fgets(buf, sizeof(buf), fp)) {

const char *prefix =

(buf[(sizeof buf)-1] == '\0' && buf[(sizeof buf)-2] != '\n'

|| feof(fp) && !strchr(buf, '\n')) ? "no " : "";



printf("Got a line which ends with %snewline: %s\n",

prefix, buf);



buf[(sizeof buf)-1] = 1;
}


return EXIT_SUCCESS;


Please note that this can be optimized a bit (and improved in understandability) like this:

const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] != '\n'
: feof(fp) && !strchr(buf, '\n'))
? "no "
: "";


Or like this (here I've switched "no " and ""):
const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] == '\n'
: !feof(fp) || strchr(buf, '\n'))
? ""
: "no ";



Note that pure boolean logic like this
(a && b) || (!a && c)
can be optimized in C to this:
a ? b : c


Ahh... the beauty and confusion of Boolean Logic!
 
J

jononanon

That's relatively easy - so long as you don't need to know where the

'\n' is.







The following approach uses strchr() rather than strlen(), so it

technically meets your specification. However, I presume you would have

the same objections to strchr() as you do to strlen(). I'd like to point

out, however, that it uses strchr() only once per file, which seems

efficient enough for me.
In fact, strchr() is called AT MOST once per file. It can happen that it isnot called at all! This is when fgets(buf, sizeof(buf), fp) returns as its_last_ non-NULL pointer (pointing at buf), a string for which the following holds: strlen(buf) + 1 == sizeof(buf)

The reason: this last non-NULL return from fgets, does not yet set EOF (since buf could be filled completely). The next call from fgets will return NULL, and set EOF; but then we don't even go into the while loop, so in this case we would call strchr zero times... since on the last iteration throughthe while-loop, we'd get feof to be 0 (false) if it gets called at all.


[PS: Relating to my previous post - Note that when using my code suggestion, we NEVER call feof on an iteration for which buf is filled to its entirety]
If you're doing so little processing per file

that a single call to strchr() per file adds significantly to the total

processing load, I'd be more worried about the costs associated with

fopen() and fclose() than those associated with strchr().



The key point is that a successful call to fgets() can fail to read in

an '\n' character only if fgets() meets the end of the input file, or

the end of your buffer, both of which can be checked for quite

efficiently. If it reaches the end of your buffer, there's one and only

one place where the '\n' character can be, if one was read in.

Therefore, it's only at the end of the file that a search is required.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])

char buf[6];
FILE *fp = stdin;



buf[(sizeof buf)-1] = 1; // any non-zero value will do.


while (fgets(buf, sizeof(buf), fp)) {
/*
const char *prefix =

(buf[(sizeof buf)-1] == '\0' && buf[(sizeof buf)-2] != '\n'

|| feof(fp) && !strchr(buf, '\n')) ? "no " : "";
*/
// improvement (see prev. post)
const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] != '\n'
: feof(fp) && !strchr(buf, '\n'))
? "no "
: "";
printf("Got a line which ends with %snewline: %s\n",

prefix, buf);



buf[(sizeof buf)-1] = 1;
}


return EXIT_SUCCESS;
 
M

Malcolm McLean

Yes, we've still got one! A 1932 manufactured Remington. Used it a lot
during the 90s. It's still in excellent condition except that I'm
unable to acquire a ribbon anywhere.
My great aunt was a secretary with the admiraltry during the war. I inherited
her typewriter, which I used to type out my first book (a maths textbook
for 11 year olds). Eventually it gummed up. The man in the typewriter repair
shop refused to take any money for fixing it.
My parents eventually chucked it out, together with the manuscript of the book.
The irony is that my father is a terrible hoarder, and all the cellars and
sheds are stuffed with junk.
 
J

jononanon

Please note that this can be optimized a bit (and improved in understandability) like this:



const char *prefix =

(buf[(sizeof buf)-1] == '\0'

? buf[(sizeof buf)-2] != '\n'

: feof(fp) && !strchr(buf, '\n'))

? "no "

: "";

Ah what the heck. Actually neither strlen() nor strchr() is needed.
We can always do this in constant time, irrespective of the length of buf.

Here's how:

const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""

The reason is this:
fgets will NEVER cause EOF to be set, if a line contains '\n'.
Even if the last character of a file is '\n', then the last non-null-returning call of fgets will deliver this '\n', but not yet set EOF.

fgets only ever causes EOF to be set in one of these conditions:
1) when returning NULL, because of having encountered EOF. Here buf delivers no string and is unchanged. {{Just take care that it can also return NULLon a read error, so NULL is not unique to EOF.}}
2) when returning non-NULL, AND delivering a string in buf, that does not fully fill buf (the element buf[sizeof(buf)-1] being untouched by fgets: '\0' being set before the end of buf), AND if the string in buf does NOT contain a '\n'.

Case 2) is only when reading a file that does NOT end in '\n'; and where the final character... when eventually read by fgets, is part of the string that does not fully fill buf (i.e. where the terminating '\0' is not at buf[sizeof(buf)-1]).

Additional: if the string fully fills buf ->
If we read a file that does NOT end in '\n'; and the final character... when eventually read by fgets, results in a string that fully fills buf (with buf[sizeof(buf)-1] being '\0'), then that call of fgets, returns non-NULL and does not yet set EOF. Here only the next call of fgets will cause if to return NULL and simultaneously set EOF.


Which all just goes to show: stdio.h and its functions are not as simple asthey seem.


Nevertheless: fgets is still deficient if you want to QUICKLY determine thelength of the delivered string (or get a pointer to its end).
 
B

Barry Schwarz

Please note that this can be optimized a bit (and improved in understandability) like this:
const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] != '\n'
: feof(fp) && !strchr(buf, '\n'))
? "no "
: "";

Ah what the heck. Actually neither strlen() nor strchr() is needed.
We can always do this in constant time, irrespective of the length of buf.

Here's how:

const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""

It would work a whole lot better if it compiled without error. On the
other hand, even after adding the missing semicolon, the algorithm
fails in those cases where fgets reads a string significantly shorter
than buf.
 
J

James Kuyper

....
const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""

It would work a whole lot better if it compiled without error. On the
other hand, even after adding the missing semicolon, the algorithm
fails in those cases where fgets reads a string significantly shorter
than buf.

If it reads a string significantly shorter than buf, then
buf[sizeof(buf)-1] == '\0' will be false, and string pointed at by
prefix will depend upon the value of feof(fp). If feof(fp), then the
input stream ended without a terminating newline, and prefix will be "no
"; otherwise, the line ended with a newline, and prefix will be "".
Except for the missing semi-colon, that looks right to me. How does your
analysis differ from mine?
 
B

Barry Schwarz

...
const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""

It would work a whole lot better if it compiled without error. On the
other hand, even after adding the missing semicolon, the algorithm
fails in those cases where fgets reads a string significantly shorter
than buf.

If it reads a string significantly shorter than buf, then
buf[sizeof(buf)-1] == '\0' will be false, and string pointed at by
prefix will depend upon the value of feof(fp). If feof(fp), then the
input stream ended without a terminating newline, and prefix will be "no
"; otherwise, the line ended with a newline, and prefix will be "".
Except for the missing semi-colon, that looks right to me. How does your
analysis differ from mine?

If the input is significantly shorter than the buffer, the contents of
the last two elements are residual from a previous read.
 
I

Ike Naar

...
const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""

It would work a whole lot better if it compiled without error. On the
other hand, even after adding the missing semicolon, the algorithm
fails in those cases where fgets reads a string significantly shorter
than buf.

If it reads a string significantly shorter than buf, then
buf[sizeof(buf)-1] == '\0' will be false, and string pointed at by
prefix will depend upon the value of feof(fp). If feof(fp), then the
input stream ended without a terminating newline, and prefix will be "no
"; otherwise, the line ended with a newline, and prefix will be "".
Except for the missing semi-colon, that looks right to me. How does your
analysis differ from mine?

If the input is significantly shorter than the buffer, the contents of
the last two elements are residual from a previous read.

You've probably missed that the call to fgets() was preceeded by

buf[(sizeof buf)-1] = 1; // any non-zero value will do.
 
B

Ben Bacarisse

Ike Naar said:
const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""
You've probably missed that the call to fgets() was preceeded by

buf[(sizeof buf)-1] = 1; // any non-zero value will do.

All of this is complicated by the fact that the buffer contents are
indeterminate if a read error occurs. If you want absolute portable
certainty you must test for ferror before looking in the buffer. The
original post is old enough that I don't have easy access to the whole
thread so this may have already been covered. Sorry if it has.
 
T

Tim Rentsch

Ben Bacarisse said:
Ike Naar said:
const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""
You've probably missed that the call to fgets() was preceeded by

buf[(sizeof buf)-1] = 1; // any non-zero value will do.

All of this is complicated by the fact that the buffer contents are
indeterminate if a read error occurs. If you want absolute portable
certainty you must test for ferror before looking in the buffer.
The original post is old enough that I don't have easy access to the
whole thread so this may have already been covered. Sorry if it
has.

If a read error occurred then fgets() returned null. The test is
done only if the value returned by fgets() is non-null.
 
J

James Kuyper

Ike Naar said:
const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""
You've probably missed that the call to fgets() was preceeded by

buf[(sizeof buf)-1] = 1; // any non-zero value will do.

All of this is complicated by the fact that the buffer contents are
indeterminate if a read error occurs. If you want absolute portable
certainty you must test for ferror before looking in the buffer. The
original post is old enough that I don't have easy access to the whole
thread so this may have already been covered. Sorry if it has.

The code above was intended as a replacement for code that occurred
inside the body of a loop that I wrote, which starts:

while(fgets(buf, sizeof(buf), fp))
{
...
}

Thus, the body is never entered if there is a read error.
My original code was not intended to be complete; had it been, the
while() loop would have been followed by

if(ferror(fp))
{
// Error Handling
}
else
{
// End of File handling
}

Also, I would ordinarily have written the while condition as
while(fgets(buf, sizeof buf, fp) == buf)

My reasons for preferring that form are surprisingly controversial.
 
J

jononanon

Also, I would ordinarily have written the while condition as

while(fgets(buf, sizeof buf, fp) == buf)

Why?

My reasons for preferring that form are surprisingly controversial.

What reasons?
(Note the endless loop when buf = NULL!)
 
E

Eric Sosman

What reasons?
(Note the endless loop when buf = NULL!)

s/endless loop/undefined behavior/, as per 7.1.4p1. Also,
James Kuyper is not the sort of damp-eared newbie who writes
`sizeof pointer_variable' to get the size of the pointed-at thing.
 
J

James Kuyper

What reasons?

For many functions with return values, the set of values that can be
represented in the return type can be partitioned into three subsets:
A) Values indicating successful operation
B) Values indicating some kind of a problem
C) Values that should not be returned.

If you assume that nothing can go wrong, checking whether the return
value is in set A is the same as checking whether the return value is
not in set B. People routinely write code based upon that assumption,
testing whichever condition is easier to express or evaluate.
I don't like making such assumptions. Among other possibilities:

1. I typed the wrong function name.
2. I remembered incorrectly which return values are in each of the three
categories.
3. The code got linked to a different library than it should have been,
containing a different function with the same name.
4. Some other part of the code contains a defect rendering the behavior
of the code undefined, and the first testable symptom of that fact is
that this particular function returns a value it's not supposed to be
able to return.
5. Other.

Therefore, I prefer to write my code so that values in set C get treated
the same way as values in set B. The problem I'm trying to deal with is
very rare, so I don't bother doing this if doing so would make the code
significantly more complicated. However, adding "== buf" falls below my
threshold for "significantly more complicated".

It was argued, when I explained this preference before, that if anything
was going so badly wrong that a standard library function was returning
a value is should not have been able to return, there was nothing useful
I could do about it - the problem could be arbitrarily bad, rendering
anything I tried to do about it pointless.

While it is technically true that the any one of the problems I
mentioned above could have arbitrarily bad consequences, that's not the
right way to bet. That argument reflects an all-or-nothing attitude that
is out of sync with reality. In my experience, most such defects, if
they don't cause your program to fail immediately, will allow it to keep
running, at least for a little while, with something close to what would
have been the standard-defined behavior, if it hadn't been for the
defect. During that window of opportunity, it's possible to detect the
invalid return value and handle it as an error condition, making it
easier to investigate the actual cause of the problem.

It was also argued that my preferred approach produces code that is
obscure and counter-intuitive. To my mind if(func()==expected_value) is
just as clear as if(func()!=error_value). However, to an extent that
reflects the fact that I'm very familiar with what the expected result
is. If the expected result is not very familiar to a given programmer,
because it's normally ignored, then my approach might seem very obscure.
(Note the endless loop when buf = NULL!)

buf was an array, not a pointer. Undefined behavior could, in principle,
cause buf==NULL to be true, but it's far harder to come up with any
reasonable low-cost alternative to cover that extremely unlikely
possibility.
 
J

James Kuyper

On 11/14/2013 01:56 PM, James Kuyper wrote:
....
I don't like making such assumptions. Among other possibilities:

1. I typed the wrong function name.
2. I remembered incorrectly which return values are in each of the three
categories.
3. The code got linked to a different library than it should have been,
containing a different function with the same name.
4. Some other part of the code contains a defect rendering the behavior
of the code undefined, and the first testable symptom of that fact is
that this particular function returns a value it's not supposed to be
able to return.

I was working so hard to cover more plausible possibilities that I
forgot to include the simplest possibility: that the function is
implemented incorrectly, so that it can in fact return a value it's not
supposed to be able to return.
For standard library functions, that's rather implausible - but it's
still not impossible. It's more plausible if the defect comes up only in
obscure situations that might not have been properly tested.
 
M

Malcolm McLean

On 11/14/2013 12:47 PM, (e-mail address removed) wrote:

For many functions with return values, the set of values that can be
represented in the return type can be partitioned into three subsets:

A) Values indicating successful operation
B) Values indicating some kind of a problem
C) Values that should not be returned.

If you assume that nothing can go wrong, checking whether the return
value is in set A is the same as checking whether the return value is
not in set B. People routinely write code based upon that assumption,
testing whichever condition is easier to express or evaluate.

I don't like making such assumptions. Among other possibilities:
But in this case the NULL return from fgets() isn't an error return, it's
expected as part of normal control flow on correct input.

The other issue is that if(fgets(buff, 1024, fp) == buff) implies that it's
somehow important that fgets() return a pointer to a memory buffer. In fact
this is a frozen design glitch, it should be returning 0 on success or EOF
on failure.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,123
Messages
2,570,735
Members
47,289
Latest member
KathrynSta

Latest Threads

Top