trim whitespace

S

Seebs

Can you say how you reach this conclusion? I can see no reason why two
pointers into one object can't be more that PTRDIFF_MAX elements apart.
The standard (rightly) shies away from requiring signed type that has
twice the range of size_t (except when size_t is minimal, in which case
it does require it).

Okay, the largest difference will obviously come when the two pointers
are bytes.

.... I think you are probably right about ptrdiff_t, though. I was still
thinking about the size_t case where the calculation was (later - earlier).

Interestingly, this leads to a conclusion: ptrdiff_t is adequate to hold
any possible pointer subtraction result for a type with a size greater
than one. Perhaps more interestingly, though, I betcha it is possible
to create a setup on at least one implementation where you can do the
calculation, and you can KNOW that the result is valid, but the result
you actually get is bogus.

Imagine that size_t and ptrdiff_t are both 32 bits, and that you have access
to an object 3GB in size, that being an array of one and a half billion
shorts (assume two-byte shorts).

ptrdiff_t p = &ary[1500000000] - &ary[0];

This calculation produces an intermediate value of 3 billion (not
representable in ptrdiff_t), which is then divided by two. The
resulting value (1.5 billion) is representable in ptrdiff_t, but it
is quite possible that instead you'll end up with a number which
is on the close order of negative five hundred million.

Anyone got access to a machine with the described characteristics?

-s
 
S

Seebs

[1] Re "Then there is no badness", I presume Seebach references
only the little microcosm in which such a program compiles and
runs, rather than the world as a whole.

Hee. Yes, it was contextual; I was referring to the "badness" Kelly
had been talking about, not to the broader category.

-s
 
S

Seebs

I threw it together to try and understand what happens when pointers
and/or pointer arithmetic overflow.

And the answer is, undefined behavior happens.
I don't have a good understanding
of that yet, but once I do, I think I can make trim() bullet-proof.

Nope. Once undefined behavior is in the picture, there is nothing you can
do. You can't even guarantee that the pointers in question can be loaded
into address registers without killing your program.

-s
 
S

Seebs

Black hats don't respect the standard. They're looking for any hole
they can find.

Yes. And if you write code that allows buffer overruns, that "undefined
behavior" can be a compromise of the user's machines. So don't allow
buffer overruns.
Standards conformance has its place, but how can you protect yourself if
you don't explore UB and learn how things really work.

By not reaching undefined behavior in the first place.

Furthermore, "how things really work" is semantically invalid. They really
work however it made sense to the implementor for them to work on a given
target. That can vary from one compiler to another, from one target to
another, even by compiler flags of various sorts.

-s
 
S

Seebs

We had a lengthy discussion about that not long ago. The standard
merely says that size_t is the type of the result of the sizeof
operator; there's no explicit statement that it can represent the
size of any object.

True. But in practice, I'd consider an object larger than the range of
size_t to be pathological. I think if it came down to it, I'd call it
undefined behavior to create such an object, even if it happened to
work.
In real life, I think that any *sane* implementation will choose a
size_t type big enough to represent the size of any object it can
support, and calloc() will fail and return a null pointer for any
request to create such a huge object.
Change "By definition" to "In practice", or even to "By what the
definition *should* be", and I agree.

Pretty much.

-s
 
S

Seebs

Problem is, trim() can be called with an arbitrary pointer. Who knows
what it points to? It's not limited to "objects." Maybe it's a region
of memory larger than size_t, which contains no \0 bytes. An algorithm
must protect itself at all times.

No, it must protect itself for valid inputs, clearly define invalid inputs,
and hope for the best.

void *v = malloc(1);
free(v);
trim(&v);

There is NOTHING you can do about that one. You can never know whether
v was recently freed. You can never prevent yourself from accessing its
contents, and since we don't know what they are, it's quite possible that
they don't represent a string at all. It's also theoretically possible
that 'v' points to a series of space characters, followed by the contents
of another allocation.

There are always gonna be things you cannot detect or test for.

-s
 
J

John Kelly

There is no way for your trim() function, as specified, to
protect itself against all bad inputs. trim() can't tell the
difference between, say, a pointer to a 1000-byte object with no
'\0' characters and a pointer to a 2000-byte object with a '\0'
somewhere near the end. As it scans for the terminating '\0',
it has no way to know when to give up.

But I'm going to fix it, so that it gives up after reaching some
predetermined limit.

Specifically, there is no portable way to do this in C.

The only question is what limit to choose. I could pick some arbitrary
number like 32767 that will work on the vast majority of platforms, and
satisfy the vast majority of trim() use cases.

But for the sake of good design, I'm looking for a limit that may vary
from one platform to another. Hints are welcome.

If you have sufficient control over *all* potential callers of
your trim() function, then you can avoid undefined behavior.

No. It should be callable by anybody with any arbitrary pointer,
without going into an infinite loop or producing wrong results. If it
can't handle the data, it should set errno and return -1.

All string functions should be like that. Do you mean to tell me the
standard C library functions can go into an infinite loop on unexpected
conditions?


If you don't, you can't.

I'm not convinced of that yet.
 
K

Keith Thompson

John Kelly said:
But I'm going to fix it, so that it gives up after reaching some
predetermined limit.

Sure, you can do that. It's certainly not what I'd do, but it's your
function. (I urge you to clearly document the limit, and the function's
behavior if the limit is exceeded.)
The only question is what limit to choose. I could pick some arbitrary
number like 32767 that will work on the vast majority of platforms, and
satisfy the vast majority of trim() use cases.

But for the sake of good design, I'm looking for a limit that may vary
from one platform to another. Hints are welcome.

Whatever limit you choose, there will be arguments for which the
function blows up because the limit was too big, and arguments for which
it falsely reports an error because the limit was too small.
No. It should be callable by anybody with any arbitrary pointer,
without going into an infinite loop or producing wrong results. If it
can't handle the data, it should set errno and return -1.

Sorry, that's not possible in general.
All string functions should be like that. Do you mean to tell me the
standard C library functions can go into an infinite loop on unexpected
conditions?

Certainly. More precisely, their behavior can be undefined for certain
arguments.
I'm not convinced of that yet.

Consider this program:

#include <string.h>
#include <stdio.h>

int main(void) {
char not_a_string[5] = "hello";
size_t len = strlen(not_a_string);
printf("len = %zu\n", len);
return 0;
}

How can the implementation of strlen() detect that there's a problem
and avoid undefined behavior? Would imposing a limit avoid the
problem? The same considerations apply to your trim() function.

No answer?
 
B

Ben Bacarisse

Seebs said:
Okay, the largest difference will obviously come when the two pointers
are bytes.

... I think you are probably right about ptrdiff_t, though. I was still
thinking about the size_t case where the calculation was (later - earlier).

Interestingly, this leads to a conclusion: ptrdiff_t is adequate to hold
any possible pointer subtraction result for a type with a size greater
than one.

That seems to me to be a good bet rather than something that is guaranteed.

<snip>
 
J

John Kelly

Whatever limit you choose, there will be arguments for which the
function blows up because the limit was too big, and arguments for which
it falsely reports an error because the limit was too small.

There is only one argument. A pointer. It may point to a valid string,
or it may point to whatever garbage is in some region of memory.

Certainly.

I haven't looked at the C library code to see if that's true or not.
But I hate to think it is.

Consider this program:

#include <string.h>
#include <stdio.h>

int main(void) {
char not_a_string[5] = "hello";
size_t len = strlen(not_a_string);
printf("len = %zu\n", len);
return 0;
}

How can the implementation of strlen() detect that there's a problem
and avoid undefined behavior?

On a computer there's no such thing as an object of infinite size. Your
address bits define a limit. Once you reach the limit, stop looking for
end of string. It's not there.

Would imposing a limit avoid the problem?

I want to prevent infinite loops and report errors. I'm not sure what
you want.

The same considerations apply to your trim() function.

No answer?

Maybe later.
 
J

John Kelly

That seems to me to be a good bet rather than something that is guaranteed.

Seebs I see pieces of your posts that others quote.

You might have some good ideas, but you seem inflamed by my posts and
you're too confusing for me to read.

Life is short. Try and enjoy it.
 
S

Seebs

But I'm going to fix it, so that it gives up after reaching some
predetermined limit.

That's not fixing it, that's just replacing one failure mode with another.

Some years ago, I was using a GUI system which allowed you to specify
scrolling list items by passing an array of items into a function. They
"helpfully" ignored any call with more than 250 items, because OBVIOUSLY
you didn't mean to do that. This behavior cost me a number of hours of
debugging, and cost them a call through to their engineers.
But for the sake of good design, I'm looking for a limit that may vary
from one platform to another. Hints are welcome.

Only "hints" that agree with your preconceived notion seem to be welcome.
No. It should be callable by anybody with any arbitrary pointer,
without going into an infinite loop or producing wrong results. If it
can't handle the data, it should set errno and return -1.

This is not possible. It is not even CLOSE to possible.

I would recommend that you pick a language other than C; your expectations
are incompatible with C's design.
All string functions should be like that. Do you mean to tell me the
standard C library functions can go into an infinite loop on unexpected
conditions?

They can dump core, they can loop forever, they can loop until they've
destroyed all the memory you gave them access to.

Consider:

char *s = malloc(25);
strcpy(s, "this is not going to work");
strcpy(s + 10, s);

This will, on many implementations, write over every byte of memory that
it can reach until it crashes. On older implementations, that can
mean wiping out ALL of the contents of memory INCLUDING OTHER PROGRAMS.

Again.

YOU CANNOT CHECK A POINTER FOR VALIDITY IN C. It's that simple. You
can't check it for validity, you can't check it for bounds. Trying to
outsmart the user by enforcing "obvious" limits is usually a dead end.

You cannot write a function in C which does something useful with a valid
pointer and does not do something unuseful with an invalid pointer. You
can check for null pointers, but you can't check for freed pointers,
pointers with bounds too small, or any of that. You just can't, and no
amount of good intentions will make it possible.

-s
 
P

Peter Nilsson

Ian Collins said:
Setting errno without any other indication of failure
is counter intuitive.  Why not return a success or failure?

Because sometimes the return value has another purpose,
e.g. maths functions like sin() and cos(). Even in fread(),
a zero result does not mean an error occured. There is
another condition (captured inside FILE) which flags input
error.
 
I

Ian Collins

Because sometimes the return value has another purpose,
e.g. maths functions like sin() and cos(). Even in fread(),
a zero result does not mean an error occured. There is
another condition (captured inside FILE) which flags input
error.

You forgot the strol(l) functions.

Maybe, but that isn't the case here.
 
L

lawrence.jones

Seebs said:
Imagine that size_t and ptrdiff_t are both 32 bits, and that you have access
to an object 3GB in size, that being an array of one and a half billion
shorts (assume two-byte shorts).

ptrdiff_t p = &ary[1500000000] - &ary[0];

This calculation produces an intermediate value of 3 billion (not
representable in ptrdiff_t), which is then divided by two. The
resulting value (1.5 billion) is representable in ptrdiff_t, but it
is quite possible that instead you'll end up with a number which
is on the close order of negative five hundred million.

I don't think that's allowed -- if the result is representable, the
implementation is obliged to get it right. (Since you only need one
extra bit and most machines have some kind of carry flag that can be
used to get the right answer easily, it's not a great burden.)
 
S

Seebs

Seebs I see pieces of your posts that others quote.

That's nice.
You might have some good ideas, but you seem inflamed by my posts and
you're too confusing for me to read.

That's fine.
Life is short. Try and enjoy it.

I do.

I don't post answers to your questions because I think you'll read them
or that you'd understand them if you did, but because some of your
questions are questions which other people would benefit from answers
to. I'm sure it's accidental.

-s
 
S

Shao Miller

John said:
Hmmm.

The memmove length argument is a size_t. Can I be assume (keep - hast)
will never overflow size_t? Can I assume anything about the result of
the pointer subtraction?
If you have any concerns about subtracting pointers and 'size_t', how
about this? (Some parentheses are redundant but included as visual aids):

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;

/* Trim left */
while (isspace(*string = *i))
++i;
if (!*string)
/* Empty string or only spaces */
return string;

/* Copy remaining string */
while (*string = *i) {
++string;
++i;
}

/* Enable for security */
#if 0
/* Truncate with erasure */
while (i != string)
*(i--) = 0;
#endif

/* Trim right */
--string;
while (isspace(*string))
*(string--) = 0;
++string;

/* Return a pointer to the terminator */
return string;
}

char *trim(char *string) {
return string ? trim_unsafe(string) : string;
}

/**
* Testing trim function
*/
#include <stdio.h>

/* Handy for arrays */
#define NUM_OF_ELEMENTS(array_) \
(sizeof (array_) / sizeof *(array_))
#define FOR_EACH_ELEMENT(index_, array_) \
for ((index_) = 0; (index_) < NUM_OF_ELEMENTS(array_); (index_)++)

int main(void) {
char *tests[] = {
"",
" ",
" ",
"f",
" f",
" f",
" f ",
" f ",
"f ",
"f ",
"foo bar baz",
" foo bar baz",
" foo bar baz",
" foo bar baz ",
" foo bar baz ",
"foo bar baz ",
"foo bar baz "
};
int i;

FOR_EACH_ELEMENT(i, tests) {
char buf[80];
strcpy(buf, tests);
printf("BEFORE: \"%s\"\n", buf);
trim(buf);
printf(" AFTER: \"%s\"\n\n", buf);
}
/* printf("%p\n", trim(NULL)); */

return 0;
}
 
K

Keith Thompson

John Kelly said:
There is only one argument. A pointer. It may point to a valid string,
or it may point to whatever garbage is in some region of memory.



I haven't looked at the C library code to see if that's true or not.
But I hate to think it is.

You snipped the part where I explained that invalid arguments cause
undefined behavior. It won't *necessarily* show up as an infinite loop;
that's just one of many possibilities. You don't need to look at the
code that implements strlen() (on a particular system) to understand
this.
Consider this program:

#include <string.h>
#include <stdio.h>

int main(void) {
char not_a_string[5] = "hello";
size_t len = strlen(not_a_string);
printf("len = %zu\n", len);
return 0;
}

How can the implementation of strlen() detect that there's a problem
and avoid undefined behavior?

On a computer there's no such thing as an object of infinite size. Your
address bits define a limit. Once you reach the limit, stop looking for
end of string. It's not there.

First off, there is no portable way to tell when you've run out of
address bits. The standard says very little about how addresses are
represented.

But ok, given that an address is 32 bits, you could stop looking after
2**32 bytes. But that will still take you beyond the bounds of the
object you're examining.

Look again at the strlen() example above. Suppose the array is
followed in the machine's address space by a chunk of memory that
your process doesn't own. How can strlen() or trim() detect this
and avoid blowing up?

If you have a pointer to the beginning of a 100-byte array with a
'\0' in the last position, you must scan for 100 bytes. If you have
a pointer to a 10-byte array, not containing any '\0' characters,
immediately followed in the address space by memory not owned by
your process, you must not scan for more than 10 bytes. You cannot
tell the difference in any portable manner, and you very likely
cannot tell the difference even in some non-portable manner.
I want to prevent infinite loops and report errors. I'm not sure what
you want.

I want to explain to you how this stuff is actually defined.

[...]

Note that some languages treat arrays and/or strings as first-class
objects whose values carry their bounds with them. In such languages,
you can avoid these problems. For example, in Ada passing an array
parameter implicitly passes the array's bounds, which can be retrieved
from the parameter; in Perl, strings are scalar objects. In C, you
just have to be careful.
 
S

Seebs

Seebs said:
Imagine that size_t and ptrdiff_t are both 32 bits, and that you have access
to an object 3GB in size, that being an array of one and a half billion
shorts (assume two-byte shorts).

ptrdiff_t p = &ary[1500000000] - &ary[0];

This calculation produces an intermediate value of 3 billion (not
representable in ptrdiff_t), which is then divided by two. The
resulting value (1.5 billion) is representable in ptrdiff_t, but it
is quite possible that instead you'll end up with a number which
is on the close order of negative five hundred million.
I don't think that's allowed -- if the result is representable, the
implementation is obliged to get it right. (Since you only need one
extra bit and most machines have some kind of carry flag that can be
used to get the right answer easily, it's not a great burden.)

It seems to me like you SHOULD get the right result, but I won't be
surprised if this is a bug in at least one implementation, just
because I can't imagine it coming up very often.

-s
 
S

Shao Miller

Keith said:
John Kelly said:
There is only one argument. A pointer. It may point to a valid string,
or it may point to whatever garbage is in some region of memory.

I haven't looked at the C library code to see if that's true or not.
But I hate to think it is.

You snipped the part where I explained that invalid arguments cause
undefined behavior. It won't *necessarily* show up as an infinite loop;
that's just one of many possibilities. You don't need to look at the
code that implements strlen() (on a particular system) to understand
this.
Consider this program:

#include <string.h>
#include <stdio.h>

int main(void) {
char not_a_string[5] = "hello";
size_t len = strlen(not_a_string);
printf("len = %zu\n", len);
return 0;
}

How can the implementation of strlen() detect that there's a problem
and avoid undefined behavior?
On a computer there's no such thing as an object of infinite size. Your
address bits define a limit. Once you reach the limit, stop looking for
end of string. It's not there.

First off, there is no portable way to tell when you've run out of
address bits. The standard says very little about how addresses are
represented.
We do have the pigeonhole principle though, I think. 'CHAR_BIT * sizeof
(void *)' should give a reasonable maximum number of bits, shouldn't it?
If any pointer can be converted to a 'void *', assuming there isn't
any meta-data outside of the pointer's object representation, and
assuming an implementation must not allow two 'void *' values to
represent two separate objects, it seems like a fair upper bound.
But ok, given that an address is 32 bits, you could stop looking after
2**32 bytes. But that will still take you beyond the bounds of the
object you're examining.
I would be very hopeful that an implementation that actually checks
bounds also offers a documented means for the programmer to access those
bounds. If an implementation does not satisfy this hope, that would be
unfortunate.
Look again at the strlen() example above. Suppose the array is
followed in the machine's address space by a chunk of memory that
your process doesn't own. How can strlen() or trim() detect this
and avoid blowing up?
Let alone anyone. If there's no bounds information anywhere, what
actually determines the bounds? Intention? :) If an implementation or
the environment has traps for such things, the information exists
somewhere. That somewhere might or might not be accessible to the
programmer, and even if accessible, might be a lot of work.
If you have a pointer to the beginning of a 100-byte array with a
'\0' in the last position, you must scan for 100 bytes. If you have
a pointer to a 10-byte array, not containing any '\0' characters,
immediately followed in the address space by memory not owned by
your process, you must not scan for more than 10 bytes. You cannot
tell the difference in any portable manner, and you very likely
cannot tell the difference even in some non-portable manner.
Well is John making a string-trimming function or a 'char[]'-trimming
function? For the latter, passing a count or a size in bytes might be a
good idea.
I want to prevent infinite loops and report errors. I'm not sure what
you want.

I want to explain to you how this stuff is actually defined.

[...]

Note that some languages treat arrays and/or strings as first-class
objects whose values carry their bounds with them. In such languages,
you can avoid these problems. For example, in Ada passing an array
parameter implicitly passes the array's bounds, which can be retrieved
from the parameter; in Perl, strings are scalar objects. In C, you
just have to be careful.
And if a pointer value includes bounds information, perhaps the
implementation would be kind enough to document either the
representation or how to access the information. Maybe not.

For full portability but great effort, John could force users of the
library to only pass references to objects which themselves were created
by other of John's functions. One could provide macros instead of
declarations, or use allocated storage exclusively, or include signature
checks to ensure the programmer used the provided functions to create
their objects. Seems complex, though. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,086
Messages
2,570,598
Members
47,221
Latest member
LashundaCh

Latest Threads

Top