trim whitespace

I

ImpalerCore


Would you be more willing to accept it if instead of a defensive
programming style, it had a contract programming style? You could
consider the version that silently crashes a contract programming
style, except that the terms of the contract are implicit, discovered
either through the documentation or spending time in the debugger.

\code snippet
#define C_REQUIRE( expr ) \
do \
{ \
if ( !(expr) ) { \
gc_contract_violation( "Precondition failed: " #expr, \
__FILE__, __LINE__ ); \
} \
} while(0)

char* trim( char* s )
{
C_REQUIRE( s != NULL );

... rest of trim ...

return s;
}
\endcode

where C_REQUIRE would basically do what assert does, except with the
added connotation that it's a precondition of the function, rather
than a generic assertion.
No.  Here's the thing:

If I were DESIGNING the library, I'd insist on it being robust in that
way, probably.  (There's a case to be made, in some environments, for
not doing that -- if things get passed around enough, you're testing
the same pointer for null/not-null dozens of times.)

But since it ISN'T that way, and it's a standard part of the language,
I expect users to know that.

The problem is that a NULL pointer argument can represent valid or
invalid semantics. If the client provides a NULL argument where it is
invalid semantics (like passing a NULL pointer to the printf format
string), it's a software defect and should in an ideal world be
corrected rather than let slide. These viewpoints about function
design seem to be a tradeoff between robustness and correctness.

I ask myself, is the NULL pointer viewed as a valid semantic
representation of a string object? If I looked at it from the point
of designing strlen, I think there _is_ a semantic difference between
strlen( NULL ) and say strlen( "" ). The expression 'strlen( NULL )'
doesn't make any sense because in my mind, NULL isn't a representation
of any valid string buffer, and it's not considered an alternate form
of strlen( "" ). Why should clients be given a false impression that
a NULL pointer is a valid string representation in some string
functions (like trim) and not in others (like strlen, where it results
in a crash)? It gives mixed signals of what a NULL char* pointer
represents.

On the other side, one example that can define a NULL pointer as
semantically valid is a linked list, where the NULL pointer is
commonly used to indicate an empty list.
I think it sometimes does.


Mine is that I basically always do it in my code, and never assume anyone
else does it.  Except free() because I'm emotionally attached to the
sanity of free(malloc(1));

My string functions currently do check for null arguments and skip any
processing in their current form, much like your version of trim, but
I'm weighing whether to convert the behavior to use an explicit
precondition like that shown in C_REQUIRE above, to keep the same kind
of no-nonsense pressure the standard library provides to force
developers to use string functions the right way, rather than giving
them the freedom to do it the wrong way.
 
J

John Kelly

I don't have a problem in general with functions that pass NULL
arguments through and do nothing to avoid a crash, but does a function
that crashes on a NULL pointer imply a design flaw?

God's algorithms never crash. I, though imperfect, desire the same
excellence.
 
J

John Kelly

My string functions currently do check for null arguments and skip any
processing in their current form, much like your version of trim, but
I'm weighing whether to convert the behavior to use an explicit
precondition like that shown in C_REQUIRE above, to keep the same kind
of no-nonsense pressure the standard library provides to force
developers to use string functions the right way, rather than giving
them the freedom to do it the wrong way.

Are you saying crashing is "no-nonsense pressure"? You also said:

I would go with other people's recommendation and use a return value
for error if possible, and errno to indicate the kind of error.

That's what I've decided to do. Functions should NEVER crash no matter
how bad the input.
 
T

Thad Smith

Dann said:
#include <ctype.h>
#include <string.h>

char *
alltrim (char *s)
{
char *start = s;
char *trimend;
size_t end;
if (!s)
return s;
end = strlen (s);
if (!end)
return s;
while (isspace (*start))
start++;
trimend = &s[end-1];
while (isspace (*trimend))
{
*trimend-- = 0;
}
return start;
}
#ifdef UNIT_TEST
#include <stdio.h>
static char s0[] = " I need trimming on both ends. ";
static char s1[] = "I need trimming on far end. ";
static char s2[] = " I need trimming on near end.";
static char s3[] = " I need more trimming on both ends. ";
static char s4[] = "\n\t\rI need special trimming on both ends.\n\t\r ";
static char s5[] = " \n\t\r I need special trimming on near end.";
static char s6[] = "I need special trimming on far end. \n\t\r ";
static char s7[] = "I need no trimming";
static char s8[] = "";

static char *strings[12] =
{
s0, s1, s2, s3, s4, s5, s6, s7, s8, NULL,
};

int
main (void)
{
int i;
for (i = 0; i < 10; i++)
{
printf ("Original string:[%s]\n", strings);
printf ("Trimmed string:[%s]\n", alltrim (strings));
puts("---------------------------------------");
}
return 0;
}
#endif


Which elements of
struct {
char a[2];
char b[2];
} s = {" ", " "};

are set to 0 by
alltrim(s.b);
?
 
T

Thad Smith

Seebs said:
I would check ts before checking *ts. (It's also not obvious why
you're passing ts in as a "char **".)


It is rare but conceivable that printf will itself alter errno
even if it succeeds; check the FAQ for the famous question of
why errno has changed to ENOTTY after a successful call to printf.

I would usually send errors to stderr, not stdout, but perhaps
more importantly, I would find it pretty upsetting for a "trim
whitespace" function to produce a diagnostic error if I handed
it an empty string!

True, but that tests for a null pointer to the string, not an empty string.
 
J

John Kelly

No. Overlapping strcpy is *never* safe.

What was I thinking.

But you have pointers to the beginning and end of the string so you know
its length and thus can use memmove instead, which *is* safe.

I suspect memmove involves a temporary buffer.

If I save a copy of the base pointer, I can also calculate how much was
trimmed on the left. Knowing that, I think I can use memcpy two times,
without overlap in either copy operation. Seems more clever than using
a temporary buffer.
 
S

Seebs

The problem is that a NULL pointer argument can represent valid or
invalid semantics. If the client provides a NULL argument where it is
invalid semantics (like passing a NULL pointer to the printf format
string), it's a software defect and should in an ideal world be
corrected rather than let slide. These viewpoints about function
design seem to be a tradeoff between robustness and correctness.
Somewhat.

I ask myself, is the NULL pointer viewed as a valid semantic
representation of a string object? If I looked at it from the point
of designing strlen, I think there _is_ a semantic difference between
strlen( NULL ) and say strlen( "" ). The expression 'strlen( NULL )'
doesn't make any sense because in my mind, NULL isn't a representation
of any valid string buffer, and it's not considered an alternate form
of strlen( "" ). Why should clients be given a false impression that
a NULL pointer is a valid string representation in some string
functions (like trim) and not in others (like strlen, where it results
in a crash)? It gives mixed signals of what a NULL char* pointer
represents.

Here's my thought: The difference is whether the return makes
additional representations, such as "that was a valid string". If
the behavior is
if (s == NULL)
return NULL;

then I'm not converting an invalid string into valid data; I'm leaving
it invalid.

For strlen(), there's no value (maybe a -1, if you didn't use
size_t) that communicates "that wasn't even a string, your
question is invalid".

-s
 
S

Seebs

I suspect memmove involves a temporary buffer.
If I save a copy of the base pointer, I can also calculate how much was
trimmed on the left. Knowing that, I think I can use memcpy two times,
without overlap in either copy operation. Seems more clever than using
a temporary buffer.

Stop trying to outsmart the library.

In everything I've ever seen, memmove is much much MUCH faster than
anything you can write, and does not use a temporary buffer -- it just
ensures that it copies in an order that produces the desired results.

You've proposed a much more complicated and error-prone solution
which is virtually always slower...

-s
 
J

James Waldby

I suspect memmove involves a temporary buffer.

If I save a copy of the base pointer, I can also calculate how much was
trimmed on the left. Knowing that, I think I can use memcpy two times,
without overlap in either copy operation. Seems more clever than using
a temporary buffer.

Using memcpy twice might be more clever than using a temporary buffer
(although I don't see just how you expect two memcpy's to work; perhaps
you could illustrate your method, moving a long string a small ways
left and a small ways right?) Anyhow, your rationale is wrong: memmove
needs no temporary buffer. It merely decides which way to move source
bytes -- near-end first or far-end first.

Note, following URL has some timing with memmove faster than memcopy.
<http://vgable.com/blog/2008/05/24/memcopy-memmove-and-speed-over-safety>
 
J

John Kelly

Using memcpy twice might be more clever than using a temporary buffer
(although I don't see just how you expect two memcpy's to work

You're right. That idea was bogus. The first segfault hit me like a
hammer. You could do it with a loop, moving little pieces, but that's
not worth the trouble.

memmove needs no temporary buffer. It merely decides which way to move
source bytes -- near-end first or far-end first.

memmove it is then.
 
N

Nick

ImpalerCore said:
That's a fair enough answer. The question is, would you still accept/
use a version of trim without the NULL argument check?

Hey guys. It must be at least three weeks since we did "assert() is
evil/assert() is the best C feature in the world". Shall we do it again?
 
S

Shao Miller

John said:
static void
trim (char **ts)
{
unsigned char *exam;
unsigned char *keep;

if (!*ts) {
errno = EINVAL;
printf ("trim: %s\n", strerror (errno));
return;
}
exam = (unsigned char *) *ts;
while (isspace (*exam)) {
++exam;
}
*ts = (char *) exam;
if (!*exam) {
return;
}
keep = exam;
while (*++exam) {
if (!isspace (*exam)) {
keep = exam;
}
}
if (*++keep) {
*keep = '\0';
}
}


Anyone see bugs? It's not a trick question, I use this code. Just
wondering if I overlooked anything.
Just missing headers. Heheheh. While we're sharing 'trim's and asking
for review (for bugs), here's one:

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
size_t i = 0;

/* Trim left */
while (isspace(string[0] = string))
++i;
if (!string[0])
/* Empty string or only spaces */
return string;

/* Copy remaining string */
while (string[0] = string)
++string;

/* Enable for security */
#if 0
/* Truncate with erasure */
while (i)
string[i--] = 0;
#endif

/* Trim right */
--string;
while (isspace(string[0]))
string--[0] = 0;
++string;

/* Return a pointer to the terminator */
return string;
}

char *trim(char *string) {
return string ? trim_unsafe(string) : string;
}

/**
* Testing trim function
*/
#include <stdio.h>

/* Handy for arrays */
#define NUM_OF_ELEMENTS(array_) \
(sizeof (array_) / sizeof *(array_))
#define FOR_EACH_ELEMENT(index_, array_) \
for ((index_) = 0; (index_) < NUM_OF_ELEMENTS(array_); (index_)++)

int main(void) {
char *tests[] = {
"",
" ",
" ",
"f",
" f",
" f",
" f ",
" f ",
"f ",
"f ",
"foo bar baz",
" foo bar baz",
" foo bar baz",
" foo bar baz ",
" foo bar baz ",
"foo bar baz ",
"foo bar baz "
};
int i;

FOR_EACH_ELEMENT(i, tests) {
char buf[80];
strcpy(buf, tests);
printf("BEFORE: \"%s\"\n", buf);
trim(buf);
printf(" AFTER: \"%s\"\n\n", buf);
}
/* printf("%p\n", trim(NULL)); */

return 0;
}
 
J

John Kelly

So if you were to allocate a buffer, read something into it, trim it,
print it and free it, you'd be in trouble. I tend to take the overhead
of shuffling it all down for the convenience of the pointer not moving.

Thanks Nick, for the motivator, and others, for your good input.

The double ** is because I use it for argv option arguments like this:

trim (&optarg);

and argv non-options like this:

trim (argv);

I fixed up the code to do the memmove shuffle, elminate the printf, and
return an int value 0 or -1, with errno set accordingly. And now, let's
reboot this thread:



# include <ctype.h>
# include <errno.h>
# include <stdio.h>
# include <string.h>

static int
trim (char **ts)
{
unsigned char *exam;
unsigned char *hast;
unsigned char *keep;

if (!*ts) {
errno = EINVAL;
return -1;
}
exam = (unsigned char *) *ts;
while (isspace (*exam)) {
++exam;
}
hast = keep = exam;
while (*exam) {
if (!isspace (*exam)) {
keep = exam;
}
++exam;
}
if (*keep) {
*++keep = '\0';
}
if (hast != (unsigned char *) *ts) {
(void) memmove (*ts, hast, keep - hast + 1);
}
return 0;
}


static char s0[] = " I need trimming on both ends. ";
static char s1[] = "I need trimming on far end. ";
static char s2[] = " I need trimming on near end.";
static char s3[] = " I need more trimming on both ends. ";
static char s4[] = "\n\t\rI need special trimming on both ends.\n\t\r ";
static char s5[] = " \n\t\r I need special trimming on near end.";
static char s6[] = "I need special trimming on far end. \n\t\r ";
static char s7[] = "I need no trimming";
static char s8[] = " ";
static char s9[] = "";
static char *strings[12] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL,
};

int
main (void)
{
int i;
for (i = 0; i < 11; i++) {
printf ("Original string:[%s]\n", strings);
trim (&strings);
printf ("Trimmed string:[%s]\n", strings);
puts ("---------------------------------------");
}
return 0;
}
 
J

John Kelly

/* Enable for security */
#if 0
/* Truncate with erasure */
while (i)
string[i--] = 0;
#endif


The original string is still in memory. Erasing its old artifacts won't
improve security.
 
S

Shao Miller

John said:
/* Enable for security */
#if 0
/* Truncate with erasure */
while (i)
string[i--] = 0;
#endif


The original string is still in memory. Erasing its old artifacts won't
improve security.
The pattern of whitespace is removed. Perhaps one encodes secrets in
whitespace patterns. If your secure string eraser looks for a
terminator, it won't help to erase your whitespace patterns which are
still sitting past that terminator.
 
J

John Kelly

Why bother checking if 's' refers to NULL? The other C string
functions in the standard library fail hard when passed a NULL
pointer, so why should trim be any different?

Because it will crash if you do pass a NULL pointer. It should be
different because this behavior is better. Here is my implementation:

#include <ctype.h>
#include <string.h>

char *
alltrim (char *s)
{
char *start = s;
char *trimend;
size_t end;
if (!s)
return s;
end = strlen (s);
if (!end)
return s;
while (isspace (*start))
start++;
trimend = &s[end-1];
while (isspace (*trimend))
{
*trimend-- = 0;
}
return start;
}


For efficiency, I avoided strlen(). Please see my revised version, it
has several improvements.

And thanks for the test code:

#ifdef UNIT_TEST
#include <stdio.h>
static char s0[] = " I need trimming on both ends. ";
static char s1[] = "I need trimming on far end. ";
static char s2[] = " I need trimming on near end.";
static char s3[] = " I need more trimming on both ends. ";
static char s4[] = "\n\t\rI need special trimming on both ends.\n\t\r ";
static char s5[] = " \n\t\r I need special trimming on near end.";
static char s6[] = "I need special trimming on far end. \n\t\r ";
static char s7[] = "I need no trimming";
static char s8[] = "";

static char *strings[12] =
{
s0, s1, s2, s3, s4, s5, s6, s7, s8, NULL,
};

int
main (void)
{
int i;
for (i = 0; i < 10; i++)
{
printf ("Original string:[%s]\n", strings);
printf ("Trimmed string:[%s]\n", alltrim (strings));
puts("---------------------------------------");
}
return 0;
}
#endif
 
J

John Kelly

if (hast != (unsigned char *) *ts) {
(void) memmove (*ts, hast, keep - hast + 1);
}

Hmmm.

The memmove length argument is a size_t. Can I be assume (keep - hast)
will never overflow size_t? Can I assume anything about the result of
the pointer subtraction?
 
D

Dann Corbit

Which elements of
struct {
char a[2];
char b[2];
} s = {" ", " "};

are set to 0 by
alltrim(s.b);
?

Needs a check for and pointer less than or equal to start.
 
S

Seebs

The memmove length argument is a size_t. Can I be assume (keep - hast)
will never overflow size_t? Can I assume anything about the result of
the pointer subtraction?

If you know the pointer subtracted-from is later in an object, then you can
assume that the result of the pointer subtraction is a valid size_t. By
definition, size_t can represent the size of any object. Thus, in the most
extreme case (a maximally-large object which is an array of characters),
size_t can hold the difference between the last and first addresses into
that object, therefore, you're fine.

.... as long as you're sure that keep is after hast.

-s
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top