Variable-sized lines of text in linked list

R

Richard Heathfield

CBFalconer said:

However you carefully snipped the portion where I pointed out that
this is no different than using the malloc family in any program.

I carefully snipped out the portion where you mentioned malloc because I
didn't see how it was relevant to what I was saying. I *always* remove (or
at least I always try to remember to remove) text from my reply if it is
irrelevant to the point I am making about the article to which it relates.

It is certainly true that, every time you call malloc, you are responsible
for freeing the memory thus allocated. It is for *precisely* this reason
that it's a good idea to reduce the number of calls to malloc to the bare
minimum. The ggets function fails to do this.
 
C

CBFalconer

Bartc said:
.... snip ...

Reading an entire file into memory is an admirable idea, but is
frowned on in this newsgroup for various reasons: the size of a
file is impossible to determine, or could change while you're
reading the file, or because it might be a 100GB monster you
therefore should never attempt this even for tiny files.

That doesn't mean you shouldn't do it; just keep quiet about it
here :)

Disagree. For some operations, it is virtually necessary. For
example the freverse exercise used as a test for ggets, which
reverses the content of an entire text file, byte by byte.

At the same time bear in mind that many, if not most, general
systems today supply virtual memory. That means that malloc won't
run out of space until the disk does, to all practical purposes.
 
J

John Bode

I am trying to read a text file into memory without any knowledge of
how long each line will be. I am looking to store each line in a
linked list structure, however, I am unsure of how to dynamically
allocate space for each line.

This is how I set up the linked list...

typedef struct node {
char *line;
struct node *next;

} linkedlist;

linkedlist* createlinkedlist(void) {
linkedlist* head;
head = (linkedlist *)malloc(sizeof(linkedlist));

head = malloc(sizeof *head);

You don't need to cast the result of malloc() in C, and in fact doing
so may mask a type mismatch error if you forget to #include stdlib.h
or otherwise don't have a prototype for malloc() in scope.

You also don't have to worry about syncing up typenames everywhere.

Finally, *always* check the result of malloc().

if (!head)
/* handle out-of-memory condition */
head->line = NULL;
head->next = NULL;
return head;

}

void addnode(linkedlist* list, char *line) {
linkedlist* freespot;
linkedlist* newnode;
freespot = list;
while (freespot->next != NULL)
freespot = freespot->next;
newnode = (linkedlist *)malloc(sizeof(linkedlist));

newnode = malloc(sizeof *newnode);

Same comment as above.
newnode->line = line;

If it were me, I'd create a new copy of the line, rather than just
storing the pointer; if the pointer you passed in later gets free()'d
or changed, then you've lost your data.

newnode->line = malloc(strlen(line) + 1);
if (newnode->line)
strcpy(newnode->line, line);
else
/* handle out-of-memory condition */
newnode->next = NULL;
freespot->next = newnode;

}

So with this in place, how can I read in variable length lines,
malloc() the proper storage for each and pass the pointer to
addnode()?

Here's how I've done it in the past (untested, probably some holes):

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define BUFSIZE ... /* Big enough to handle most lines of text,
but not so big as to be a waste of space.
Exact value will depend on what you're reading.
*/

typedef enum {
NEXTLINE_OK,
NEXTLINE_MEM, /* Could not allocate or extend buffer */
NEXTLINE_ERR, /* Read error on input */
NEXTLINE_EOF /* Hit end of file */
} NextlineStatus;

NextlineStatus GetNextLine(FILE *stream, char **line, size_t *length)
{
char inputBuffer[BUFSIZE];
int done = 0;
NextLineStatus status;

*length = 0;

/**
* Read the next BUFSIZE-1 characters from the input
* stream.
*/
while (!done && fgets(inputBuffer, sizeof inputBuffer, stream))
{
/**
* Attempt to extend the buffer by the number of
* characters read plus 1.
*
* The first time through, this just allocates
* a new buffer.
*
* We use tmp because it's not guaranteed that
* realloc will succeed, and we don't want
* to overwrite line in that case.
*/
char *tmp = realloc(*line, *length + strlen(inputBuffer) + 1);
if (!tmp)
{
/**
* realloc failed. For now, leave line and length alone,
* and just return the out of memory status.
*/
status = NEXTLINE_MEM;
done = 1;
}
else
{
*line = tmp;
*length += strlen(inputBuffer) + 1;

/**
* Look for the newline character in the
* input buffer. If it's present,
* then we've read the whole line;
* replace the newline with a null
* terminator and set the done
* flag.
*/
if (strchr(inputBuffer, '\n'))
{
done = 1;
status = NEXTLINE_OK;
*strchr(inputBuffer, '\n') = 0;
}

/**
* Append the buffer to the line
*/
strcat(*line, inputBuffer);
}
}

/**
* If fgets() fails before the done flag is set, then we've
* either hit an end-of-file condition or an error.
*/
if (!done)
{
if (feof(stream))
{
status = NEXTLINE_EOF;
}
else
{
status = NEXTLINE_ERR;
}
}

return status;
}

int main(void)
{
linkedlist *myList = createlinkedlist();
NextLineStatus status = NEXTLINE_OK;
FILE *inFile;
char *nextLine = NULL;
size_t lineLength = 0;
...
while (status == NEXTLINE_OK)
{
char *nextLine;
size_t lineLength;
status = GetNextLine(inFile, &nextLine, &lineLength);
if (status == NEXTLINE_OK)
{
addnode(myList, nextLine);
free(nextLine);
lineLength = 0;
}
}
...
}
 
B

Ben Pfaff

CBFalconer said:
Disagree. For some operations, it is virtually necessary. For
example the freverse exercise used as a test for ggets, which
reverses the content of an entire text file, byte by byte.

If reversing a file bigger than memory is a likely application,
then I would use a different approach, for example reading and
reversing as large a block of data as will fit in memory at once
and writing it to a file, repeating this with additional files
while data is still arriving, then outputting the files in
reverse order of their creation when the end of the input is
reached.
At the same time bear in mind that many, if not most, general
systems today supply virtual memory. That means that malloc won't
run out of space until the disk does, to all practical purposes.

My PC here has 160 GB of free disk space, but only 3 GB of
virtual address space is available to any given application at a
time, so that is not remotely true.
 
R

Richard Tobin

At the same time bear in mind that many, if not most, general
systems today supply virtual memory. That means that malloc won't
run out of space until the disk does, to all practical purposes.
[/QUOTE]
My PC here has 160 GB of free disk space, but only 3 GB of
virtual address space is available to any given application at a
time, so that is not remotely true.

And virtual memory is not necessarily all-you-can-eat. Many systems
have pre-allocated swap space.

-- Richard
 
C

CBFalconer

Richard said:
CBFalconer said:



I carefully snipped out the portion where you mentioned malloc because I
didn't see how it was relevant to what I was saying. I *always* remove (or
at least I always try to remember to remove) text from my reply if it is
irrelevant to the point I am making about the article to which it relates.

It is certainly true that, every time you call malloc, you are responsible
for freeing the memory thus allocated. It is for *precisely* this reason
that it's a good idea to reduce the number of calls to malloc to the bare
minimum. The ggets function fails to do this.

The ggets documentation specifically states that memory is
allocated, and that the ggets caller is responsible for freeing
it. Just as for malloc.
 
J

Joe Wright

Richard said:
CBFalconer said:



I carefully snipped out the portion where you mentioned malloc because I
didn't see how it was relevant to what I was saying. I *always* remove (or
at least I always try to remember to remove) text from my reply if it is
irrelevant to the point I am making about the article to which it relates.

It is certainly true that, every time you call malloc, you are responsible
for freeing the memory thus allocated. It is for *precisely* this reason
that it's a good idea to reduce the number of calls to malloc to the bare
minimum. The ggets function fails to do this.
Now that I have you and Chuck together, does ggets or getline treat nul
characters in the input stream? I would declare the nul illegal in a
text stream and be done with it. What do you two do about it?
 
C

CBFalconer

Joe said:
Now that I have you and Chuck together, does ggets or getline treat nul
characters in the input stream? I would declare the nul illegal in a
text stream and be done with it. What do you two do about it?

In ggets/fggets, nothing special. That means the nul will go in
the string, and effectively means "ignore everything following up
to the line ending '\n'" However, it is not a printing char, and I
have no idea what the i/o system will do with it.

At least I think that is the right answer. You have the complete
source available, and can do whatever you wish with it. For
instance, a "if (!ch) continue;" in the right place would ignore
it.
 
C

CBFalconer

Ben said:
.... snip ...


My PC here has 160 GB of free disk space, but only 3 GB of
virtual address space is available to any given application at a
time, so that is not remotely true.

Yes, I went overboard there.
 
B

Bill Reid

Bartc said:
Reading an entire file into memory is an admirable idea, but is frowned on
in this newsgroup for various reasons: the size of a file is impossible to
determine, or could change while you're reading the file, or because it
might be a 100GB monster you therefore should never attempt this even for
tiny files.

Yes, and even if that wasn't true, it would be frowned upon by several
frequent posters here because the Sun is going to burn out in a few
billion years so why even bother and we're all gonna die no matter what
and they cancel all the good TV shows and my butt itches etc. etc.
etc. etc. ad nauseum...

I like the way they always find a way NOT to do something, or anything
really, in public, and for some even under what I presume are their actual
names, and for those that actually have jobs, sometimes posting from their
places of "employment"! Now THAT'S a dedicated and unabashed "union
man"! Not a speck of usable code for four years, but they dare you to
fire them!

The truth is, I try to do as much file processing right in the file streams,
not for ANY of the above reasons, but that since I open several thousands
of files every day (programmatically), I save measurable amounts of time
by not reading them into memory (and some memory too!).
That doesn't mean you shouldn't do it; just keep quiet about it here :)

Yeah, I'm used to it, I can take it...I just keep finding practical
programming solutions to the "problems" the frequent posters
say are "impossible", but then, since I'm not a programmer, I'm
not a member of the "union"!
 
R

Richard Heathfield

CBFalconer said:

The ggets documentation specifically states that memory is
allocated, and that the ggets caller is responsible for freeing
it. Just as for malloc.

It's very nice of you to document the problem in a text file included in
the distribution zip file (which means people don't get to find out about
this responsibility until *after* they've downloaded the package), but
that's not the point I'm making - which I am now beginning to despair of
explaining to you. I've tried words of one syllable. Perhaps words of no
syllables would do the trick? Here goes:
 
R

Richard Heathfield

Joe Wright said:

I would declare the nul illegal in a
text stream and be done with it. What do you two do about it?

Why do anything about it? The C Standard doesn't forbid null characters
from appearing in a text stream, so I see no reason to forbid it or even
especially to test for it.
 
C

CBFalconer

Richard said:
CBFalconer said:




It's very nice of you to document the problem in a text file
included in the distribution zip file (which means people don't
get to find out about this responsibility until *after* they've
downloaded the package), but that's not the point I'm making -
which I am now beginning to despair of explaining to you. I've
tried words of one syllable. Perhaps words of no syllables would
do the trick? Here goes:

Here is an extract from ggets.h. Why do you feel things are not
covered?

/* fggets and ggets [which is fggets(ln, stdin)] set *ln to
a buffer filled with the next complete line from the text
stream f. The storage has been allocated within fggets,
and is normally reduced to be an exact fit. The trailing
\n has been removed, so the resultant line is ready for
dumping with puts. The buffer will be as large as is
required to hold the complete line.

Note: this means a final file line without a \n terminator
has an effective \n appended, as EOF occurs within the read.

If no error occurs fggets returns 0. If an EOF occurs on
the input file, EOF is returned. For memory allocation
errors some positive value is returned. In this case *ln
may point to a partial line. For other errors memory is
freed and *ln is set to NULL.

Freeing of assigned storage is the callers responsibility
*/
 
R

Richard Heathfield

CBFalconer said:
Here is an extract from ggets.h. Why do you feel things are not
covered?

Okay, here are the issues as I see them:

1) the function doesn't allow the user to minimise memory management issues
by re-using existing buffers.
2) the function doesn't allow the user to resist denial-of-memory attacks.
3) the function doesn't allow the user to specify whether the buffer should
be resized to fit.
4) the function documentation doesn't mention any of these problems.
5) in Usenet articles where you recommend ggets, you don't mention any of
these problems.
 
R

Richard Tobin

I would declare the nul illegal in a
text stream and be done with it. What do you two do about it?
[/QUOTE]
Why do anything about it? The C Standard doesn't forbid null characters
from appearing in a text stream, so I see no reason to forbid it or even
especially to test for it.

With a function that relies on null-termination to tell you the length
of a string, you'll never be able to detect that a null was present and
will silently "lose" the rest of the line. This is almost never what
you want (and I think the "almost" is redundant).

Using such a function strongly implies that there shouldn't be any
nulls in the stream. Their presence means your data is corrupted,
and since you can't check for them afterwards the function should
return an error indication, just as if it got a read error.

-- Richard
 
R

Randy Howard

Randy Howard said:



Well, they /are/ in the same order - but one takes an extra parameter.

I didn't mean to imply that the common arguments weren't in the same
order, but rather to keep them in order if rearranged.
So the question is where the delimiters parameter should go.
Yes.

At the beginning? No - that's a really good place for the thing that
actually gets changed (following in the footsteps of *printf, strcpy,
strcat, memcpy, memset, etc).

More to the point, putting it up front isn't consistent.
At the end? No - that's a really good place for frequently-unused optional
information.

It's also a way to keep the calling argument order consistent between
similar functions. This would be my choice.
So it has to go somewhere in the middle. And once we've determined that,
it's really a matter of taste.

If you can come up with a compelling argument for putting the parameter
elsewhere, I'm all ears.

It's preference really, but I'd much rather have arguments appear in
the same location in similar calls, rather than prioritized by how
often they are used, if a choice has to be made. I find it makes
memorization of parameter lists easier and requires fewer trips to the
man pages. YMMV.

I don't expect you or anyone else to agree with me necessarily, but
that is how I would do it.
 
R

Richard Heathfield

Randy Howard said:
It's also a way to keep the calling argument order consistent between
similar functions. This would be my choice.

It's not an unreasonable choice, either. Goodbye, One Right Answer, and
hello, trade-off.

I'll think about it. Fair enough?
 
M

Michael Mair

Richard said:
CBFalconer said:

I hope you do not suggest to put a "risks, potentially harmful
side effects, and other vulnerabilities you contract when using
this piece of code" declaration on top of the download link --
this just makes a couple of useful functions very unattractive
by unwise advertising.
Okay, here are the issues as I see them:

1) the function doesn't allow the user to minimise memory management issues
by re-using existing buffers.
2) the function doesn't allow the user to resist denial-of-memory attacks.
3) the function doesn't allow the user to specify whether the buffer should
be resized to fit.
4) the function documentation doesn't mention any of these problems.
5) in Usenet articles where you recommend ggets, you don't mention any of
these problems.

I agree with points 1) through 4), though I would not want to get
the full one-page advertisement every time Chuck or someone else
recommends ggets() or fggets().

Would extending the family of functions by more-control-to-the-user
variants in combination with an explanation of the shortcomings of
the variants with the "nicer" user interface address these issues
sufficiently? I see that the download page mentions your fgetline()
which certainly is more flexible but lacks the ease of use; IMO,
filling the "space" in between may be sufficient.


As an aside, the link to "Morris Dovey's getsm function."
(http://www.iedu.com/mrd/c/getsm.c) on
http://www.cpax.org.uk/prg/writings/fgetdata.php
seems to be dead.


Cheers
Michael
 
R

Randy Howard

Randy Howard said:


It's not an unreasonable choice, either. Goodbye, One Right Answer, and
hello, trade-off.

I'll think about it. Fair enough?

For the record, I wasn't asking you to change it, I was asking about
why you got there, which you explained.
 
R

Richard Heathfield

Michael Mair said:

Would extending the family of functions by more-control-to-the-user
variants in combination with an explanation of the shortcomings of
the variants with the "nicer" user interface address these issues
sufficiently? I see that the download page mentions your fgetline()
which certainly is more flexible but lacks the ease of use; IMO,
filling the "space" in between may be sufficient.

That's an interesting idea, although there is a risk that people will take
one look at this huge "family" of functions and run screaming for their
nearest available fgets!
As an aside, the link to "Morris Dovey's getsm function."
(http://www.iedu.com/mrd/c/getsm.c) on
http://www.cpax.org.uk/prg/writings/fgetdata.php
seems to be dead.

MORRIS! Can you shed any light on this?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top