trim whitespace

J

John Kelly

First off, there is no portable way to tell when you've run out of
address bits. The standard says very little about how addresses are
represented.

But ok, given that an address is 32 bits, you could stop looking after
2**32 bytes. But that will still take you beyond the bounds of the
object you're examining.

For the purposes of making trim() robust against bad input, there's no
such thing as an "object" to be examined. If the user points me to some
random garbage memory, then I will try and trim the garbage. If I can
find a \0 out there in the garbage, I'll treat the garbage as a string
and trim it.

I don't care what pointer the user gives me as input. That's his
problem. I just want to guarantee that no matter what he gives me, I
won't be stuck in an infinite loop.

Look again at the strlen() example above. Suppose the array is
followed in the machine's address space by a chunk of memory that
your process doesn't own. How can strlen() or trim() detect this
and avoid blowing up?
If you have a pointer to the beginning of a 100-byte array with a
'\0' in the last position, you must scan for 100 bytes. If you have
a pointer to a 10-byte array, not containing any '\0' characters,
immediately followed in the address space by memory not owned by
your process, you must not scan for more than 10 bytes. You cannot
tell the difference in any portable manner, and you very likely
cannot tell the difference even in some non-portable manner.

Segfaults are the caller's problem, not mine. If he gives me a garbage
pointer, I'll try and do what the dummy asks.

I want to explain to you how this stuff is actually defined.

I think I understand it. But I don't think you understand what my
objective is, or why.

I just finished up the bullet proof version. Here it comes ...
 
S

Seebs

For the purposes of making trim() robust against bad input, there's no
such thing as an "object" to be examined. If the user points me to some
random garbage memory, then I will try and trim the garbage. If I can
find a \0 out there in the garbage, I'll treat the garbage as a string
and trim it.

You're still not understanding.

The mere FACT of accessing random garbage memory is undefined behavior.

You have lost. You cannot prevent undefined behavior if you are given
bad inputs.
I don't care what pointer the user gives me as input. That's his
problem. I just want to guarantee that no matter what he gives me, I
won't be stuck in an infinite loop.

Again, not possible. ***ANY*** interaction, ANY AT ALL, with the
hypothetical invalid pointer, is undefined behavior. Merely checking
its value to see whether it is a null pointer is undefined behavior.
Looking at even a single byte of its contents is undefined behavior.

And undefined behavior can, among other things, put you in an infinite
loop.
Segfaults are the caller's problem, not mine. If he gives me a garbage
pointer, I'll try and do what the dummy asks.

Exactly.

And that's why all this talk about guarantees is nonsense -- your
code, given particular bad inputs, could do just about anything.
I think I understand it. But I don't think you understand what my
objective is, or why.

You've said repeatedly that your goal is to always produce a valid
response no matter the input. That's impossible.

You've sometimes fallen back on just not looping forever -- but I
can't imagine why, given that this is one of the least likely failure
modes one could conceive of. (Indeed, I don't know of any real-world
systems where it would happen, ever.)

-s
 
G

Geoff

The mere FACT of accessing random garbage memory is undefined behavior.

You have lost. You cannot prevent undefined behavior if you are given
bad inputs.

GIGO still applies. IIRC, the term was invented for C. :)
 
S

Shao Miller

Seebs said:
You're still not understanding.

The mere FACT of accessing random garbage memory is undefined behavior.

You have lost. You cannot prevent undefined behavior if you are given
bad inputs.
You can force a user of your function to create good inputs, but that
would seem like a lot of effort.
Again, not possible. ***ANY*** interaction, ANY AT ALL, with the
hypothetical invalid pointer, is undefined behavior. Merely checking
its value to see whether it is a null pointer is undefined behavior.
Looking at even a single byte of its contents is undefined behavior.
You can pass unions like:

union char_ptr {
char *p;
unsigned char bytes[sizeof (char *)];
};

and you can examine the bytes via the 'bytes' member, no? Just don't
use 'p'. You can even compare these bytes to a null pointer object
representation. That would seem to be a lot of effort, though.
And undefined behavior can, among other things, put you in an infinite
loop.


Exactly.

And that's why all this talk about guarantees is nonsense -- your
code, given particular bad inputs, could do just about anything.
You can get your guarantees, given sufficient effort. Force all inputs
through an opaque interface. Good and bad is then yours to define. You
can digitally sign "good" references to objects.
You've said repeatedly that your goal is to always produce a valid
response no matter the input. That's impossible.
Outside of the context of the 'trim' function, could this function
produce a valid response no matter the input?:

int func(char *cp) {
return 0;
}
You've sometimes fallen back on just not looping forever -- but I
can't imagine why, given that this is one of the least likely failure
modes one could conceive of. (Indeed, I don't know of any real-world
systems where it would happen, ever.)
What would cause one to loop forever, wrap-around? How about multiple
'memmove's, each within a manageable range?
 
K

Keith Thompson

Shao Miller said:
Keith Thompson wrote: [...]
But ok, given that an address is 32 bits, you could stop looking after
2**32 bytes. But that will still take you beyond the bounds of the
object you're examining.
I would be very hopeful that an implementation that actually checks
bounds also offers a documented means for the programmer to access those
bounds. If an implementation does not satisfy this hope, that would be
unfortunate.

Who says the implementation checks bounds?
Let alone anyone. If there's no bounds information anywhere, what
actually determines the bounds? Intention? :)

More or less. Violating bounds causes undefined behavior, but the
language provides no way to detect those bounds. If that's not
acceptable, you might consider a language other than C.

Yes, C implementations *can* peform bounds checking, but most don't,
and portable code cannot depend on it.

[...]
Well is John making a string-trimming function or a 'char[]'-trimming
function? For the latter, passing a count or a size in bytes might be a
good idea.

I won't try to speak for him.

[...]
 
K

Keith Thompson

John Kelly said:
I don't care what pointer the user gives me as input. That's his
problem. I just want to guarantee that no matter what he gives me, I
won't be stuck in an infinite loop.

For some bad inputs, you cannot avoid undefined behavior.
One possible consequence of undefined behavior is an infinite loop.

[...]
Segfaults are the caller's problem, not mine. If he gives me a garbage
pointer, I'll try and do what the dummy asks.

That's perfectly reasonable. The documentation for your trim()
function establishes a contract between the function and the caller.
If the caller violates that contract by passing something that
doesn't point to a string, the behavior is undefined.
I think I understand it. But I don't think you understand what my
objective is, or why.

I think I understand what your objective is. I'm just telling you that
it's not possible to achieve it.
 
J

John Kelly

For some bad inputs, you cannot avoid undefined behavior.
One possible consequence of undefined behavior is an infinite loop.

I don't know about you, but I can always find a way to guarantee my code
will never be stuck in an infinite loop.

I think I understand what your objective is. I'm just telling you that
it's not possible to achieve it.

Despite what you believe, I did.
 
J

John Kelly

Well is John making a string-trimming function or a 'char[]'-trimming
function? For the latter, passing a count or a size in bytes might be a
good idea.

NO!

A pointer, and nothing more.
 
S

Shao Miller

Keith said:
Shao Miller said:
Keith Thompson wrote: [...]
But ok, given that an address is 32 bits, you could stop looking after
2**32 bytes. But that will still take you beyond the bounds of the
object you're examining.
I would be very hopeful that an implementation that actually checks
bounds also offers a documented means for the programmer to access those
bounds. If an implementation does not satisfy this hope, that would be
unfortunate.

Who says the implementation checks bounds?
Nobody. I was agreeing.
More or less. Violating bounds causes undefined behavior, but the
language provides no way to detect those bounds.
I was agreeing.
If that's not
acceptable, you might consider a language other than C.
"One" might, I agree. Personally, I accept that C allows for an
implementation to check bounds and does not provide for a programmer to
determine if bounds are checked, and what the bounds are, if they are
checked.
Yes, C implementations *can* peform bounds checking, but most don't,
and portable code cannot depend on it.
If you're referring to pointer arithmetic resulting in an overflow, I
fully agree that it would be difficult for portable code to prevent
overflow when the notion of the reality of a choice for "number of
elements" in an "array object" is not even defined. Declarations come
close for automatic and static storage, 'calloc' comes pretty close for
allocated storage, but does that prevent 'malloc' from producing "array
objects"? Doubtful. If "the array is large enough" seems a lot like
"intention".

Thank goodness most don't. :)

Portable code can still force good inputs, with the cost of complexity,
perhaps.
[...]
Well is John making a string-trimming function or a 'char[]'-trimming
function? For the latter, passing a count or a size in bytes might be a
good idea.

I won't try to speak for him.
It seems like 'trim' could be usefully implemented for both
circumstances each, but perhaps not a single implementation for both.
 
J

John Kelly

Well is John making a string-trimming function or a 'char[]'-trimming
function? For the latter, passing a count or a size in bytes might be a
good idea.

I won't try to speak for him.
It seems like 'trim' could be usefully implemented for both
circumstances each, but perhaps not a single implementation for both.

Well at first I said NO, but on second thought ...

You could have another argument that's optional. If it's NULL, operate
without it, and treat the data as a string. If it has a value, treat it
as an array.

Should be easy to integrate with the code I already wrote. Do I have to
do everything? Get to it!

:)
 
S

Seebs

I don't know about you, but I can always find a way to guarantee my code
will never be stuck in an infinite loop.

No, you can't. Not if you are given invalid inputs.
Despite what you believe, I did.

No, you didn't.

You produced something which, given a suitable invalid pointer input,
will do crazy things such as core dumping *or looping forever* on
some targets.

-s
 
U

Uno

Seebs said:
Yes. And if you write code that allows buffer overruns, that "undefined
behavior" can be a compromise of the user's machines. So don't allow
buffer overruns.


By not reaching undefined behavior in the first place.

Oh no. Undefined behavior. Therefore I must shit my pants.
Furthermore, "how things really work" is semantically invalid. They really
work however it made sense to the implementor for them to work on a given
target. That can vary from one compiler to another, from one target to
another, even by compiler flags of various sorts.

'"how things really work" is semantically invalid.'

That must sound just as right as ugly german monarchs and shepherd's
pie, which, I hear, tastes good.

I was interested that you claimed bp to be less than british now.
 
K

Kelsey Bjarnason

[snips]

For the purposes of making trim() robust against bad input, there's no
such thing as an "object" to be examined. If the user points me to some
random garbage memory, then I will try and trim the garbage. If I can
find a \0 out there in the garbage, I'll treat the garbage as a string
and trim it.

Assume an implementation where a pointer consists of two parts: a segment
identifier and an offset.

On allocation, the OS reserves a virtual memory region - which may well
not have any actual memory behind it. On read or write, the OS detects a
fault (no memory page mapped), sorts out how to handle it, digs up a
usable page of memory, maps it in and away you go, reading or writing.

Then you free the memory, telling the OS that the virtual memory region
is now invalid, that it does not exist.

What happens now, if you try to - as you say - "try and trim the
garbage"? You read a byte, the system faults, the OS's memory manager
kicks in to load the associated page of memory, but there _is no_
associated page of memory. You are trying to access memory which _does
not exist_. The app - if you're lucky - is summarily killed by the OS
for trying to poke its nose into memory it doesn't own.

In C terms, by calling "free", you disposes of the object in question,
but then attempted to examine the object after the fact. C's answer to
this is "undefined behaviour", where *any* outcome is perfectly
acceptable: crashing, "working", trashing memory belonging to other
processes, setting your CPU on fire, causing your kitten to tie you up
and flog you with damp tea bags, it's _all_ perfectly acceptable as far
as C is concerned.

Once that pointer has become invalid, there is simply _no manner_ in
which you can use it, for any meaningful operation, other than assigning
a new value to it - which does not help one bit in trying to do what
you're seeking to accomplish.
 
N

Nick

John Kelly said:
But I'm going to fix it, so that it gives up after reaching some
predetermined limit.



The only question is what limit to choose. I could pick some arbitrary
number like 32767 that will work on the vast majority of platforms, and
satisfy the vast majority of trim() use cases.

But for the sake of good design, I'm looking for a limit that may vary
from one platform to another. Hints are welcome.

I just wouldn't do that. If you make it at least as safe as strlen (and
you're making it safer because it's resistant to a null pointer as
input) that's good enough for any program that's likely to use your
routine. Far better than dying arbitrarily in the future when asked to
trim the OED on a machine with 10PB of memory or similar.
 
J

John Kelly

Assume an implementation where a pointer consists of two parts: a segment
identifier and an offset.

On allocation, the OS reserves a virtual memory region - which may well
not have any actual memory behind it. On read or write, the OS detects a
fault (no memory page mapped), sorts out how to handle it, digs up a
usable page of memory, maps it in and away you go, reading or writing.

Then you free the memory, telling the OS that the virtual memory region
is now invalid, that it does not exist.

What happens now, if you try to - as you say - "try and trim the
garbage"? You read a byte, the system faults, the OS's memory manager
kicks in to load the associated page of memory, but there _is no_
associated page of memory. You are trying to access memory which _does
not exist_. The app - if you're lucky - is summarily killed by the OS
for trying to poke its nose into memory it doesn't own.

In C terms, by calling "free", you disposes of the object in question,
but then attempted to examine the object after the fact. C's answer to
this is "undefined behaviour", where *any* outcome is perfectly
acceptable: crashing, "working", trashing memory belonging to other
processes, setting your CPU on fire, causing your kitten to tie you up
and flog you with damp tea bags, it's _all_ perfectly acceptable as far
as C is concerned.

Once that pointer has become invalid, there is simply _no manner_ in
which you can use it, for any meaningful operation, other than assigning
a new value to it - which does not help one bit in trying to do what
you're seeking to accomplish.

This is the same objection as Keith's, which I already answered.
 
J

John Kelly

Oh no. Undefined behavior. Therefore I must shit my pants.

'"how things really work" is semantically invalid.'

That must sound just as right as ugly german monarchs and shepherd's
pie, which, I hear, tastes good.

I was interested that you claimed bp to be less than british now.

Heh.

Another reason I don't read Seebs. He wastes time with nebulous ideas
and opinion. Life is too short for that.
 
J

John Kelly

If you have any concerns about subtracting pointers and 'size_t', how
about this? (Some parentheses are redundant but included as visual aids):

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;

/* Trim left */
while (isspace(*string = *i))
++i;
if (!*string)
/* Empty string or only spaces */
return string;

/* Copy remaining string */
while (*string = *i) {
++string;
++i;
}

/* Enable for security */
#if 0
/* Truncate with erasure */
while (i != string)
*(i--) = 0;
#endif

/* Trim right */
--string;
while (isspace(*string))
*(string--) = 0;
++string;

/* Return a pointer to the terminator */
return string;
}

char *trim(char *string) {
return string ? trim_unsafe(string) : string;
}

It looks like you're making two passes of the right side space, once
when copying, and again when trimming. I like to avoid any extra work.

Think hard, work smart.
 
L

Lew Pitcher

I haven't followed this thread, and my suggestions may be redundant...

If you have any concerns about subtracting pointers and 'size_t', how
about this? (Some parentheses are redundant but included as visual aids):

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;
[snip]

It looks like you're making two passes of the right side space, once
when copying, and again when trimming. I like to avoid any extra work.

Think hard, work smart.

ISTM that you are correct, and a good programmer could reduce this whole
process to /one/ read pass through the input string, followed by (assuming
the "unsafe" method of reusing the input string as the output) a judicious
placement of a '\0', and the return of an offset-adjusted pointer.

Here's how I would do it...

In a single loop from the start of the string to the end,
count the number of characters in the string,
count the number of leading whitespace characters, and
count the number of trailing whitespace characters

Now, place a '\0' at string[length_of_string - trailing_whitespace_count]
to truncate the string and discard the trailing whitespace characters.

Finally, return to the caller the value (string + leading_whitespace_count),
to return a pointer to the first non-whitespace of the string.

Granted, this approach modifies the input string, and cannot be used on a
const string type (i.e. " something ").

HTH
 
S

Seebs

Granted, this approach modifies the input string, and cannot be used on a
const string type (i.e. " something ").

Pickiness moment: String literals aren't const, they're just not
safely modifiable.

-s
 
J

John Kelly

I haven't followed this thread, and my suggestions may be redundant...
ISTM that you are correct, and a good programmer could reduce this whole
process to /one/ read pass through the input string, followed by (assuming
the "unsafe" method of reusing the input string as the output) a judicious
placement of a '\0', and the return of an offset-adjusted pointer.

Here's how I would do it...

In a single loop from the start of the string to the end,
count the number of characters in the string,
count the number of leading whitespace characters, and
count the number of trailing whitespace characters

Now, place a '\0' at string[length_of_string - trailing_whitespace_count]
to truncate the string and discard the trailing whitespace characters.

Finally, return to the caller the value (string + leading_whitespace_count),
to return a pointer to the first non-whitespace of the string.

Granted, this approach modifies the input string, and cannot be used on a
const string type (i.e. " something ").

My objective is to modify the string in-place and leave the pointer
untouched, so I don't need to return a pointer. I want to return the
count of characters in the new string.

My technique is similar to your suggested three count approach. But I
use three pointers instead of three counters. It's a natural setup for
using memmove() at the end.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,086
Messages
2,570,598
Members
47,221
Latest member
LashundaCh

Latest Threads

Top