trim whitespace v3

K

Keith Thompson

John Kelly said:
What else do you need to know?

1. What is the "system defined limit"?

2. What does the following do?
char not_a_string[5] = "hello";
trim(not_a_string);

3. For that matter, what does trim(NULL) do? You've said before that
trim() should check for a null pointer argument, but your specification
doesn't mention that case.

4. More generally, for what arguments is the behavior undefined?

5. How does it "report failure"?

6. What does "quits" mean? One could reasonably infer either that it
terminates your program or that it returns to the caller.

7. What does it return? Does it return a pointer to the trimmed
string? If so, if the input string has leading whitespace, does it
return a pointer to the first non-whitespace character, or does it
shift the existing characters and return the original pointer value?

A decent specification would answer all these questions.
 
J

John Kelly

What else do you need to know?

1. What is the "system defined limit"?

2. What does the following do?
char not_a_string[5] = "hello";
trim(not_a_string);

3. For that matter, what does trim(NULL) do? You've said before that
trim() should check for a null pointer argument, but your specification
doesn't mention that case.

4. More generally, for what arguments is the behavior undefined?

5. How does it "report failure"?

6. What does "quits" mean? One could reasonably infer either that it
terminates your program or that it returns to the caller.

7. What does it return? Does it return a pointer to the trimmed
string? If so, if the input string has leading whitespace, does it
return a pointer to the first non-whitespace character, or does it
shift the existing characters and return the original pointer value?

A decent specification would answer all these questions.


#2 is answered in the text already provided.

#5 is answered in the full text of v4

#7 could be elaborated. I'll see about that.

#1, #3, #5 could be improved. I'll see about that.

#4

a) for NULL, returns -1 after setting errno to EINVAL

b) finds a string and trims it in place

c) exhausts the search space defined by "system limit" and returns -1
after setting error to EOVERFLOW

d) produces a segfault, bus error, or some other implementation
defined fault

a), b), and c) are well understood. d) is implementation defined, not
"undefined." How would you explain (d)?
 
B

Ben Bacarisse

Seebs said:
This specifies that TWO things are modifiable -- the parameters and the
strings. But there are THREE things -- the parameters, the strings, and
the pointers to the strings that one of the parameters points to.

Sneaky, huh.

Not really. The text also means (literally) that half way though the
program argv[3] can change it's value. Have you been writing your
programs to cope with this? In fact, I don't think there is a way to
cope with it so you have to accept that many of your programs are
undefined.

The meaning is clear despite the language being imprecise.
 
S

Seebs

Not really. The text also means (literally) that half way though the
program argv[3] can change it's value. Have you been writing your
programs to cope with this?

Not really, no. :)
In fact, I don't think there is a way to
cope with it so you have to accept that many of your programs are
undefined.
The meaning is clear despite the language being imprecise.

I believe I have used at least one system on which changing argv[n]
produced surprising results. So I am inclined to think that if
the intent were to state that it was defined, it would have been
defined.

-s
 
B

Ben Bacarisse

Seebs said:
Not really. The text also means (literally) that half way though the
program argv[3] can change it's value. Have you been writing your
programs to cope with this?

Not really, no. :)
In fact, I don't think there is a way to
cope with it so you have to accept that many of your programs are
undefined.
The meaning is clear despite the language being imprecise.

I believe I have used at least one system on which changing argv[n]
produced surprising results. So I am inclined to think that if
the intent were to state that it was defined, it would have been
defined.

Do you feel the same about my point about argv[n] not holding it's last
stored value? If you came across a system where

#include <stdio.h>
int main(int argc, char **argv) { if (argc > 1) puts(argv[1]); }

produced a segmentation fault because argv[1] had changed just before
the puts, would you be saying if the intent were for argv[1] to hold its
value, it would have been stated?
 
S

Seebs

Do you feel the same about my point about argv[n] not holding it's last
stored value? If you came across a system where

Hmm.

Actually, I think maybe I do.
#include <stdio.h>
int main(int argc, char **argv) { if (argc > 1) puts(argv[1]); }

produced a segmentation fault because argv[1] had changed just before
the puts, would you be saying if the intent were for argv[1] to hold its
value, it would have been stated?

I don't think so, because argc isn't allowed to change, and argv's first
N members have to be valid pointers to strings. But I have used systems
where there existed library calls that would change the values, and I
could conceive of a debugger or system utility being able to change a
process's command line... Hmm.

I guess I view it the way I view string literals: They're not const,
but you're not allowed to change them, but that doesn't mean they can
change on their own.

-s
 
B

Ben Bacarisse

Seebs said:
Do you feel the same about my point about argv[n] not holding it's last
stored value? If you came across a system where

Hmm.

Actually, I think maybe I do.
#include <stdio.h>
int main(int argc, char **argv) { if (argc > 1) puts(argv[1]); }

produced a segmentation fault because argv[1] had changed just before
the puts, would you be saying if the intent were for argv[1] to hold its
value, it would have been stated?

I don't think so, because argc isn't allowed to change, and argv's first
N members have to be valid pointers to strings. But I have used systems
where there existed library calls that would change the values, and I
could conceive of a debugger or system utility being able to change a
process's command line... Hmm.

OK, so that was not a good example.

#include <stdio.h>

int main(int argc, char **argv)
{
const char *argv1 = argc > 1 ? argv[1] : "";
if (argv1 != argv[1]) puts("gotach!");
}

A conforming implementation may print "gotcha"?
I guess I view it the way I view string literals: They're not const,
but you're not allowed to change them, but that doesn't mean they can
change on their own.

I thought your point was that the standard does not say that they don't
change. Well, that was my point that you don't seem to be dissenting
from. I raised it because it seemed more like a oversight than a
deliberate omission.
 
S

Seebs

OK, so that was not a good example.

#include <stdio.h>

int main(int argc, char **argv)
{
const char *argv1 = argc > 1 ? argv[1] : "";
if (argv1 != argv[1]) puts("gotach!");
}

A conforming implementation may print "gotcha"?

I don't know. I wouldn't think so.
I thought your point was that the standard does not say that they don't
change. Well, that was my point that you don't seem to be dissenting
from. I raised it because it seemed more like a oversight than a
deliberate omission.

The standard doesn't say they're modifiable. And maybe it's an oversight,
rather than a deliberate omission, but:
1. I believe I've used systems where modifying them could have unexpected
results (if not results typically observable from within the program).
2. I don't see much reason to make an assumption either way.

Hmm. Here's the thing. They're things of some sort, and thus, unless
the abstract machine says they change, or they're volatile, their values
are assumed not to change. But that doesn't mean they're modifiable.

-s
 
B

Ben Bacarisse

Seebs said:
OK, so that was not a good example.

#include <stdio.h>

int main(int argc, char **argv)
{
const char *argv1 = argc > 1 ? argv[1] : "";
if (argv1 != argv[1]) puts("gotach!");
}

A conforming implementation may print "gotcha"?

I don't know. I wouldn't think so.

So why not? Your interpretation of 5.1.2.2.1 is that it does not apply
to the pointers themselves. I agree (though I think it is an
oversight). That interpretation must also extend to the fact that argv
and the strings pointed to "retain their last-stored values between program
startup and program termination" but the pointers need not. What is the
point of singling out argc, argv and the pointed to strings as holding
their values (and being modifiable) if the pointers in argv are not
being implicitly excluded from that guarantee (as you take them to be as
far as modifiability is concerned)?

I certainly understand your view, but I don't see where it comes from.
What prevents them from changing on their own? If it is some other part
of the standard that gives a blanket assurance about non-volatile
objects, then I don't see why 5.1.2.2.1 needs to say anything about argv
and friends holding their last stored values.

Why are argc, argv and the pointed to strings specifically stated to
hold their last stored values?
The standard doesn't say they're modifiable. And maybe it's an oversight,
rather than a deliberate omission, but:
1. I believe I've used systems where modifying them could have unexpected
results (if not results typically observable from within the program).
2. I don't see much reason to make an assumption either way.

Hmm. Here's the thing. They're things of some sort, and thus, unless
the abstract machine says they change, or they're volatile, their values
are assumed not to change. But that doesn't mean they're modifiable.

Yes, I accept you view about their modifiability. If they are protected
from spontaneous modification by general prohibitions on such changes,
why are argc, argv and the pointed-to strings singled out as holding
their values?

My opinion is that, since argc and friends are the interface between the
program and "the system" they need special mention. We must be told
they are modifiable and that "the system" does not change them behind
our back. My guess is that this statement is intended to apply to the
pointer in argv as well, but if you read one part of special statement
as not applying to the pointers, then I think you have to read the other
part as not applying to them as well. The reason for making this point
is not that I think argv[1] can change spontaneously, but that reading
of 5.1.2.2.1 that prevents its modification has worrying consequences
that suggest there was an oversight rather than a deliberate exclusion.
 
J

John Kelly

Hmm. Here's the thing. They're things of some sort, and thus, unless
My opinion is that, since argc and friends are the interface between the
program and "the system" they need special mention. We must be told
they are modifiable and that "the system" does not change them behind
our back. My guess is that this statement is intended to apply to the
pointer in argv as well, but if you read one part of special statement
as not applying to the pointers, then I think you have to read the other
part as not applying to them as well. The reason for making this point
is not that I think argv[1] can change spontaneously, but that reading
of 5.1.2.2.1 that prevents its modification has worrying consequences
that suggest there was an oversight rather than a deliberate exclusion.

I previously had trim() adjust the argv pointers instead of moving the
string data.

Maybe Seebs can't find his zero and he's stuck in an infinite loop.
 
S

Seebs

So why not? Your interpretation of 5.1.2.2.1 is that it does not apply
to the pointers themselves. I agree (though I think it is an
oversight). That interpretation must also extend to the fact that argv
and the strings pointed to "retain their last-stored values between program
startup and program termination" but the pointers need not.

Yes, but not to the more general statement that ALL objects which aren't
volatile-qualified have to retain their last-stored values. :)
I certainly understand your view, but I don't see where it comes from.
What prevents them from changing on their own? If it is some other part
of the standard that gives a blanket assurance about non-volatile
objects, then I don't see why 5.1.2.2.1 needs to say anything about argv
and friends holding their last stored values.

I don't think it does.
Why are argc, argv and the pointed to strings specifically stated to
hold their last stored values?

My guess is that this is there because of implementations where the space
in which those strings might be stored by default would be writeable.
So you have to copy them into private space or something.

-s
 
N

Nick

John Kelly said:
With a 1,000,000 byte string having one space at the front, your fancy
"state machine" performs 999,999 individual reads and stores. My code
moves the whole block all at once. Memmove() may be library optimized
to a single machine instruction.

See what the others have said. I agree with it.
Blindness to performance considerations is a mark of novice programmers,
Seebs pseudo-analysis notwithstanding.

And personal attacks on people trying to helpfully contribute to a
discussion is the mark of an arse.

I may not be a great programmer, but I'm far from a novice. Just for
interest, the website in my signature is written in C. I'd slightly
bashfully claim that you have to be slightly more than a novice to have
knocked that up.

For nothing more than idle curiosity, I've just cut-and-pasted the
following from the source:

dstring *dstrtrim(dstring *s) {
char *p;
size_t n=0;

assert(s!=NULL);
CHECKTRUE(s->used);
if(s->length == 0)
return s;
for(p=s->value;isspace(*p);++p)
++n;
s->length -=n;
memmove(s->value,p,s->length);
if(s->length == 0) {
*(s->value) = '\0';
return s;
}
p=s->value+s->length-1;
n=0;
while(isspace(*p)) {
--p;
++n;
}
s->length -= n;
*(++p)='\0';
return(s);
}

As you can probably tell, even without all the definitions, I get round
the whole thing by using counted strings. If you /really/ care about
string performance in C, that's almost certainly what you should be
doing. This lets me use memmove without all that ptrdiff_t twaddle.

You'll note I don't cast the value to isspace. That's because I "know"
that p only contains valid characters rather than small bytes (he
handwaves furiously).

Richard H - I apologise for using both assert and early return in there.
We clearly have very different coding styles!

I wrote this a while ago; I'd put more white space in these days.
 
S

Seebs

dstring *dstrtrim(dstring *s) {

I'd be interested in seeing the definition of dstring.
assert(s!=NULL);
CHECKTRUE(s->used);

Interesting -- presumably CHECKTRUE is weaker than "assert".
if(s->length == 0)
return s;
for(p=s->value;isspace(*p);++p)
++n;
s->length -=n;
memmove(s->value,p,s->length);

So this is an in-place trim.
if(s->length == 0) {
*(s->value) = '\0';
return s;
}
p=s->value+s->length-1;
n=0;
while(isspace(*p)) {
--p;
++n;
}
s->length -= n;
*(++p)='\0';
return(s);

Two quibbles here:

1. It is not obvious that modifying p accomplishes anything here.
2. No () on return, it's not a function. :)
As you can probably tell, even without all the definitions, I get round
the whole thing by using counted strings. If you /really/ care about
string performance in C, that's almost certainly what you should be
doing. This lets me use memmove without all that ptrdiff_t twaddle.

Interestingly, I reached the exact same conclusion in my sz string library.
You'll note I don't cast the value to isspace. That's because I "know"
that p only contains valid characters rather than small bytes (he
handwaves furiously).
Heh.

I wrote this a while ago; I'd put more white space in these days.

I thought about commenting on that but decided it hardly mattererd.

-s
 
B

Ben Bacarisse

Seebs said:
Yes, but not to the more general statement that ALL objects which aren't
volatile-qualified have to retain their last-stored values. :)


I don't think it does.

I think we've got lost. My copy of n1256.pdf says exactly that and my
point is based solely on that wording being there.
My guess is that this is there because of implementations where the space
in which those strings might be stored by default would be writeable.
So you have to copy them into private space or something.

Sorry, I see no connection between this and what I've been arguing. The
writability of the strings is not in question by either of us but I
don't see how it's writability affects argc, for example.
 
N

Nick

Seebs said:
I'd be interested in seeing the definition of dstring.


Interesting -- presumably CHECKTRUE is weaker than "assert".

It's much like an assert - it's a development time check that prints a
more useful diagnostic.

I keep free'd dstrings in a resource pool and reallocate them. As I do
so I toggle "used". That helps me catch attempts to use things that
have been freed. It doesn't catch every error, but it catches the
equivalent of:
fp=fopen...
fclose(fp);
fprintf(fp,"...
So this is an in-place trim.

Yes, inside a structure with a length, maxlength, allocated string etc.
Two quibbles here:

1. It is not obvious that modifying p accomplishes anything here.

That's a polite way of putting it! Utterly pointless and almost
certainly a hang-over from something or lazy cut-and-paste.
2. No () on return, it's not a function. :)

Wow - that must be /ancient/ code! I've not put brackets on return for
years.
 
S

Seebs

I think we've got lost. My copy of n1256.pdf says exactly that and my
point is based solely on that wording being there.

Sorry, I mean, "I don't think it needs to." I agree that it says that, but
I'm not sure it needs to.
Sorry, I see no connection between this and what I've been arguing. The
writability of the strings is not in question by either of us but I
don't see how it's writability affects argc, for example.

Sorry, dropped another packet: I meant writeable *by someone else*.
Basically, it's a warning to the implementor that if other people can
modify the space in which command lines are passed to an application, you
are obliged to copy them into an internal buffer of some sort that won't
be changing randomly.

-s
 
S

Seebs

That's a polite way of putting it! Utterly pointless and almost
certainly a hang-over from something or lazy cut-and-paste.

Ahh, it's not quite *utterly* pointless.

Wait a second. I just realized that I was wrong to begin with.
I was unconsciously assuming a "p++". But of course, this isn't a
p++, it's a ++p, and p was pointing to the last character, so ++p
is right.

If p had been pointing just past the last character, you could make a
case for "*p++" on the grounds that p should always point to the NEXT
character.

But now that I'm less sleepy, I actually think the ++ is almost certainly
correct and mandatory.

-s
 
B

Ben Bacarisse

Seebs said:
Sorry, I mean, "I don't think it needs to." I agree that it says that, but
I'm not sure it needs to.

Well that makes sense. If it needed to say it you'd have to accept my
argument! The fact is it /does/ single out argc and argv (and the
pointed-to strings) for two special mentions: that they are modifiable
and that they hold their last stored values. You are happy to accept
that the exclusion of the elements of argv is deliberate with respect to
one of these (modifiability) but not the other. That seems arbitrary.
Sorry, dropped another packet: I meant writeable *by someone else*.
Basically, it's a warning to the implementor that if other people can
modify the space in which command lines are passed to an application, you
are obliged to copy them into an internal buffer of some sort that won't
be changing randomly.

Except, it seems, the elements of argv itself. To meet the spec in the
standard, an implementation must protect the parameters of main from
external change (probably simply done by putting them on the stack as
per any function call) and it must protect the pointed-to strings from
external change by copying them, but the content of the vector pointed
to by argv need not be copied into safe, non-changing memory. That
seems to me absurd and a simple omission. There no point in making sure
that argv[2][4] does not change if argv[2] can change at any time.
Surely the simplest explanation is that the mention of argv is intended
to cover it (the parameter) and the array to which is points?

I won't keep baning on about this. I am not sure we are getting any
further and, in truth, does either of us care? My lingering curiosity
is now as to why you read a sentence that excludes argv[x] from the two
properties it imparts to argv and to argv[x][y] as if one of the two
properties need not be mentioned at all.
 
S

Seebs

Well that makes sense. If it needed to say it you'd have to accept my
argument! The fact is it /does/ single out argc and argv (and the
pointed-to strings) for two special mentions: that they are modifiable
and that they hold their last stored values. You are happy to accept
that the exclusion of the elements of argv is deliberate with respect to
one of these (modifiability) but not the other. That seems arbitrary.

My thoughts are:
1. Unless otherwise specified, everything retains its last stored value.
2. Unless otherwise specified, are not necessarily modifiable.
3. It is not a violation of any rule for the standard to occasionally
specify something which was implicit in other information already available.

I think that the statement that the contents of the strings retain their
last stored values is probably harmless but not necessary -- I think it
would have to be true anyway.
To meet the spec in the
standard, an implementation must protect the parameters of main from
external change (probably simply done by putting them on the stack as
per any function call) and it must protect the pointed-to strings from
external change by copying them, but the content of the vector pointed
to by argv need not be copied into safe, non-changing memory.

Not so. They're not qualified-volatile, therefore, they must not change
WHETHER OR NOT there's a restatement of that in this section.

If you took out the claim that the contents of the strings don't change,
I don't think the meaning of the spec would change.
I won't keep baning on about this. I am not sure we are getting any
further and, in truth, does either of us care? My lingering curiosity
is now as to why you read a sentence that excludes argv[x] from the two
properties it imparts to argv and to argv[x][y] as if one of the two
properties need not be mentioned at all.

Because the standard has an explicit statement elsewhere that objects
retain their last stored values, with some exceptions. This was never
identified as an exception. So it didn't need to be mentioned at all.

However, there's no general rule that all things you can have
pointers to are modifiable. String literals give us a nice example
of something that's not declared const, but which is not modifiable.
There might be others. This could be one of them; there's nothing
clearly stating that it isn't, and I wouldn't want to assume it was,
because I've seen things in the past which could have been
counterexamples. (Or might not, I don't know that I ever tested
it.)

It is perhaps worth noting that there is a fair amount of code out
there which attempts to overwrite the arguments to main in order
to make visible changes in a process list on Unix-like systems, and
this code is usually error-prone and unreliable, as though modifying
these pointers had unexpected effects.

-s
 
B

Ben Bacarisse

Seebs said:
My thoughts are:
1. Unless otherwise specified, everything retains its last stored value.
2. Unless otherwise specified, are not necessarily modifiable.
3. It is not a violation of any rule for the standard to occasionally
specify something which was implicit in other information already
available.

OK. Let's agree to differ. I find extra statements (especially when
they have an implicit omission) highly suggestive but I agree there is
no rule against them.
I think that the statement that the contents of the strings retain their
last stored values is probably harmless but not necessary -- I think it
would have to be true anyway.


Not so. They're not qualified-volatile, therefore, they must not change
WHETHER OR NOT there's a restatement of that in this section.

I am not 100% sure that this is a simple restatement. The reason is
that volatile-qualified objects *also* hold their last-stored value
throughout their lifetime -- it is just that the last store may be
external to the program. 6.2.4 is quote clear about that. The wording
in question seems to be saying something new that is not directly
related to volatile objects. I.e. it does not need to be said about any
object -- volatile or not and that makes me less sure that it can simply
be regarded as restating the obvious.
If you took out the claim that the contents of the strings don't change,
I don't think the meaning of the spec would change.
I won't keep baning on about this. I am not sure we are getting any
further and, in truth, does either of us care? My lingering curiosity
is now as to why you read a sentence that excludes argv[x] from the two
properties it imparts to argv and to argv[x][y] as if one of the two
properties need not be mentioned at all.

Because the standard has an explicit statement elsewhere that objects
retain their last stored values, with some exceptions. This was never
identified as an exception. So it didn't need to be mentioned at all.

However, there's no general rule that all things you can have
pointers to are modifiable. String literals give us a nice example
of something that's not declared const, but which is not modifiable.

This would be a more persuasive line if it were not for that fact that
all the (other) situations where an object isn't modifiable seem to be
very explicitly stated. I suspect that the non-modifiablity of argv[0]
though argv[argc-1] will turn out to be the only one that is not
explicit.

<snip>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

trim whitespace, bullet proof version 63
trim whitespace 194
trim 6
Trim string 42
Request for source code review of simple Ising model 88
Strange bug 65
malloc and maximum size 56
Dead Code? 4

Members online

No members online now.

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top