Noob said:
Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.
I went back to the drawing board, with all your comments and suggestions
in mind.
As far as I can tell, there are 5 "situations" to deal with:
1) non-empty line
2) empty line
3) end of stream
4) stream error
5) out of memory
For "typical" text files, getline will deal mostly with 1 and 2, and
one necessary 3 at the end of the stream. 4 and 5 are exceptional
error conditions.
With the aim of keeping the common case simple, and given that I've
stuck with a pointer return value, the simplest strategy seems to be
to return
- a valid pointer for 1 and 2
- NULL for 3, 4, 5
and let the user tell 3, 4, 5 apart using
feof for 3, ferror for 4, otherwise 5
So here's the "formal" description:
char *mygetline(FILE *stream)
mygetline dynamically allocates enough space (using malloc and friends) to
store the next complete line (a valid NUL-terminated string) from 'stream'.
The string must be free'd by the user when it is no longer needed.
mygetline may return NULL
1) when it has reached the end of the stream
2) when there is an error reading from the stream
3) when malloc fails
The user may use feof and ferror to distinguish between these cases
Here's the code (valid C89 according to gcc)
#include <stdlib.h>
#include <stdio.h>
static char *wrap_realloc(char *s, size_t len)
{
char *temp = realloc(s, len);
if (temp == NULL) free(s);
return temp;
}
char *mygetline(FILE *stream)
{
char *s = NULL;
size_t len = 500;
while ( 1 )
{
size_t max = len*2;
s = wrap_realloc(s, max);
if (s == NULL) return NULL;
while (len < max)
{
int c = getc(stream);
if (c == EOF || c == '\n')
{
s[len] = '\0';
return wrap_realloc(s, len+1);
}
s[len++] = c;
}
}
}
P.S. Eric, I did note your remark that len*2 may wrap-around, I'm just
not sure what to do in this situation...
Again, suggestions and criticism are welcome.
The problem with writing getline functions is that there are a wide
variety of semantics that people desire in given scenarios.
1. Do you read into a fixed buffer (for character arrays in
structured binary files), or attempt to grow a buffer (reading from
stdin or lines of text from an arbitrary file)?
2. Do you strip newline characters out, or leave them in?
3. If reading into a fixed buffer, what do you do when the string
terminator is not found within the expected length?
3a. Do you terminate the buffer or leave the buffer as is?
3b. How do you inform the user that the buffer contains a string
fragment (an unterminated string)? Is it an error or allowed?
3c. Do you flush any remaining characters in the stream until you hit
the delimiter or EOF?
3d. If the stream is seekable, do you reset the file pointer to its
original location if the string read is unterminated? (I've had to do
this to write an algorithm to recover records in files that had
corrupted sections from a hard drive media failure).
4. Do you pass in an allocated buffer and its size to reuse a single
buffer allocation for all line reads, or does every line get its own
allocation?
4a. If reading into a growing buffer, what kind of allocation
strategy to use (double the size, increments)?
4b. Do you resize the buffer down to the length of the string at the
end?
4c. Do you impose an arbitrary maximum limit to guard against
resource exhaustion in errant or dubious input, or to prevent some
type overflow condition?
5. How important is it to identify various errors from resource
exhaustion, running out of disk space, other stream errors, and how to
distinguish error scenarios from the end of file condition?
5a. How are errors communicated? Is it by return type, global state
like errno, another parameter in the function?
As one can see, there's a lot of choices to make when designing a
'getline' function. For your situation, I'm particularly fond of the
semantics of the POSIX version of 'getline'?
ssize_t getline(char **lineptr, size_t *n, FILE *stream);
ssize_t getdelim(char **lineptr, size_t *n, int delim, FILE *stream);
Note that getline is implemented as getdelim where [delim = '\n'].
The problem that I have with your version of getline is that a new
line buffer is allocated for each line read. This may be what you
want now for your current situation, but maybe not for something
else. The POSIX semantics allow you to pass in your own allocated
buffer via 'lineptr' and 'n', and internally uses 'realloc' if the
number of character read for the current line exceeds the allocated
size (*n) of your buffer.
The POSIX semantics include the newline character in the output, and
returns the number of characters read including the newline but
excluding the terminating nul character. The 'n' maintains the
allocated buffer size. Here's a slightly modified example from the
man page.
\code
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE *fp;
char *line = NULL;
size_t buf_len = 0;
ssize_t read;
fp = fopen("your_filename_here", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &buf_len, fp)) != -1) {
printf("Retrieved line of length %zu :\n", read);
printf("%s", line);
}
free(line);
exit(EXIT_SUCCESS);
}
\endcode
Note that only one free at the end is required if you don't need to
store all the lines in a container. Empty lines are represented as
lines with length 1 that consist of '\n'.
If you want to strip newline characters
line[read] = line[read] == '\n' ? '\0' : line[read];
If you want each line to have its own allocated buffer...
char* new_line = strdup( line );
If you don't want newlines in your final line output, you do the above
two lines in order and you'll have an allocated line without the
trailing newline for every line in your file that you can store in
your choice of container.
I personally have two classes of 'getline' functions. I use the
naming convention of 'getline' to represent the scenario where one
wants to grow a buffer to read a line, and 'readline' for semantics
that look to read a line (or C string from a file) into a fixed size
buffer, as both are useful in given contexts.
Best regards,
John D.