trim whitespace

S

Shao Miller

Lew said:
I haven't followed this thread, and my suggestions may be redundant...

If you have any concerns about subtracting pointers and 'size_t', how
about this? (Some parentheses are redundant but included as visual aids):

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;
[snip]
It looks like you're making two passes of the right side space, once
when copying, and again when trimming.
It's a security feature. If you do not desire any security at all, it's
easy enough to replace that function with:

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;
char *end = string;

/* Trim left */
while (isspace(*string = *i))
++i;

/* Copy remaining string */
while (*string = *i) {
if (!isspace(*string))
end = string + 1;
++string;
++i;
}

/* Trim right */
*end = 0;

/* Return a pointer to the terminator */
return end;
}
Like what, the erasure? Ok. At the cost of an additional pointer, the
code above shouldn't bother with erasure.

If you wish to have a single read pass and a single write pass, the
single read pass must walk the entirety of the string. How much extra
work is it to have the write pass copy any spaces on the right? How
much extra work is involved in calling 'memmove' after you've performed
your own read pass to determine end-points?
Agreed.


ISTM that you are correct, and a good programmer could reduce this whole
process to /one/ read pass through the input string, followed by (assuming
the "unsafe" method of reusing the input string as the output) a judicious
placement of a '\0', and the return of an offset-adjusted pointer.

Here's how I would do it...

In a single loop from the start of the string to the end,
count the number of characters in the string,
count the number of leading whitespace characters, and
count the number of trailing whitespace characters
Or rather than counting, keep pointers to those interesting bounds
(amounting to the same thing).
Now, place a '\0' at string[length_of_string - trailing_whitespace_count]
to truncate the string and discard the trailing whitespace characters.

Finally, return to the caller the value (string + leading_whitespace_count),
to return a pointer to the first non-whitespace of the string.
But is returning a pointer to within the string really "trimming" the
string? If other parts of a program have copies of the original
pointer, they will still point to a string which has not been left-trimmed.

By returning a pointer to the terminator, one can make it the caller's
problem if they want to count how many chars remain; with associated
'ptrdiff_t' value limitations.
Granted, this approach modifies the input string, and cannot be used on a
const string type (i.e. " something ").

HTH
Right. Maybe someone doesn't want to destructively trim, but wants:

int trim_bounds(const char *string, const char **left, const char **right);
 
J

John Kelly

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;
char *end = string;

/* Trim left */
while (isspace(*string = *i))
++i;

/* Copy remaining string */
while (*string = *i) {
if (!isspace(*string))
end = string + 1;
++string;
++i;
}

/* Trim right */
*end = 0;

/* Return a pointer to the terminator */
return end;
}

Like what, the erasure? Ok. At the cost of an additional pointer, the
code above shouldn't bother with erasure.

If you wish to have a single read pass and a single write pass, the
single read pass must walk the entirety of the string. How much extra
work is it to have the write pass copy any spaces on the right? How
much extra work is involved in calling 'memmove' after you've performed
your own read pass to determine end-points?

You do two C operations for every character. A read, and a store.

I only read each character. The memmove() may be library optimized with
assembly language.

My code will not run slower than yours, but it may run faster.


Thinking hard includes performance considerations.
 
S

Seebs

Thinking hard includes performance considerations.

While this is true, I'd advise that for something like this, you start
by writing for clarity, and don't come back to trying to optimize until
you've got some reason to believe it matters. It's quite easy for the
time spent debugging a "clever" implementation to vastly exceed the sum
of all the CPU time the routine will ever spend.

-s
 
S

Shao Miller

John said:
You do two C operations for every character. A read, and a store.
Right. Well, a read for every character, and a store for every
character after the left trim, plus one store for the terminator.
I only read each character. The memmove() may be library optimized with
assembly language.
And if it isn't?
My code will not run slower than yours, but it may run faster.
Suppose the 'memmove' implementation is little more than byte-at-a-time
copy. How many read passes do you then have?
Thinking hard includes performance considerations.
Agreed.

Suppose 'memmove' "remaps" the pointer to point directly to the target.
Well that'd certainly be a nearly instantaneous "write pass".

Maybe 'strchr' could quickly give you a pointer to the end of the string
and you can 'isspace' your way backwards to find the right-hand trim
boundary.

The implementations of 'memmove' and 'strchr' are unknowns. How can you
guarantee that your own read pass followed by a 'memmove' will be no
slower than your own read pass and your own write pass?

Function calls to C Standard Library functions can have overhead.
Checking overlap conditions in a 'memmove' implementation can have
overhead. Using a buffer in a 'memmove' implementation can have
overhead. Calculating the number of times to copy X bytes-at-a-time
followed by the remaining bytes can have overhead. I'm not suggesting
that these concerns are valid for your target implementation(s), but
they are unknowns, aren't they?
 
J

John Kelly

Right. Well, a read for every character, and a store for every
character after the left trim, plus one store for the terminator.

Are you saying this code


does not do a read and store on every iteration?

And if it isn't?

Then mine is no worse than yours.

Suppose 'memmove' "remaps" the pointer to point directly to the target.
Well that'd certainly be a nearly instantaneous "write pass".

Maybe 'strchr' could quickly give you a pointer to the end of the string
and you can 'isspace' your way backwards to find the right-hand trim
boundary.

The implementations of 'memmove' and 'strchr' are unknowns. How can you
guarantee that your own read pass followed by a 'memmove' will be no
slower than your own read pass and your own write pass?

Seems intuitive to me.

Function calls to C Standard Library functions can have overhead.
Checking overlap conditions in a 'memmove' implementation can have
overhead. Using a buffer in a 'memmove' implementation can have
overhead. Calculating the number of times to copy X bytes-at-a-time
followed by the remaining bytes can have overhead. I'm not suggesting
that these concerns are valid for your target implementation(s), but
they are unknowns, aren't they?

Maybe we should race.
 
J

John Kelly

pete said:
void
alltrim(char *s)
{
size_t shift = 0;
size_t length = strlen(s);

while (length != 0 && isspace(s[length - 1])) {
--length;
}
s[length] = '\0';
while (shift != length && isspace(s[shift])) {
++shift;
}
memmove(s, s + shift, length + 1);


The above line should be:

memmove(s, s + shift, length - shift + 1);

instead.

Thanks for the post Pete. I'm trying to avoid using strlen(), and use
only pointers. If I wanted to, I could tell the caller what the old and
new string lengths are.
 
S

Shao Miller

John said:
Are you saying this code



does not do a read and store on every iteration?
Oops. Missed that change in the paste. My claim should be accurate for:

/* Trim left */
while (isspace(*i))
++i;

Thanks, Mr. J. Kelly. Sorry about that.
Then mine is no worse than yours.
Why not? Would you not have two read passes and one write pass?
Seems intuitive to me.
Ok. Well perhaps you are right. :)
Maybe we should race.
Ok. Gathering statistics seems like a fair suggestion. But really I
offered the code because it doesn't worry about 'ptrdiff_t' or 'size_t'
and you seemed concerned about that.

If you expect that 'memmove' calls will significantly outperform this
code, it's still possible to avoid 'ptrdiff_t'. During your search for
the terminator, you could increment a count up to 'SIZE_MAX' and perform
a 'memmove' on either termination condition or 'SIZE_MAX' count reached
condition.
 
J

John Kelly

John Kelly wrote:
Why not? Would you not have two read passes

There are *not* two read passes. The full read is split among two
sections, but each section reads a separate piece. Seems like some
people are confused by the pointers. You have to read it carefully.

and one write pass?

Probably, memmove() is not equivalent to a "pass." If it's one assembly
instruction, zoom!

Ok. Gathering statistics seems like a fair suggestion. But really I
offered the code because it doesn't worry about 'ptrdiff_t' or 'size_t'
and you seemed concerned about that.

I was concerned about that, but it's all ironed out now. Thanks for the
help!

If you expect that 'memmove' calls will significantly outperform this
code, it's still possible to avoid 'ptrdiff_t'. During your search for
the terminator, you could increment a count up to 'SIZE_MAX' and perform
a 'memmove' on either termination condition or 'SIZE_MAX' count reached
condition.

True. But it's so cool using nothing but pointers.

:)
 
K

Keith Thompson

John Kelly said:
[...]
and one write pass?

Probably, memmove() is not equivalent to a "pass." If it's one assembly
instruction, zoom!

On most CPUs (especially modern ones), memmove() cannot be
implemented in one assembly instruction. Even if it can, that
instruction must perform a pass over the data; it's going to be O(N),
where N is the number of bytes moved. There's nothing magical about
microcode.

memmove() is likely to be faster than an equivalent C loop, but
only by a constant factor, and probably a fairly small one.
 
J

John Kelly

John Kelly said:
[...]
and one write pass?

Probably, memmove() is not equivalent to a "pass." If it's one assembly
instruction, zoom!

On most CPUs (especially modern ones), memmove() cannot be
implemented in one assembly instruction. Even if it can, that
instruction must perform a pass over the data; it's going to be O(N),
where N is the number of bytes moved. There's nothing magical about
microcode.

memmove() is likely to be faster than an equivalent C loop, but
only by a constant factor, and probably a fairly small one.

Why are you indugling in off topic rants?
 
S

Seebs

Probably, memmove() is not equivalent to a "pass." If it's one assembly
instruction, zoom!

This stays just as stupid no matter how often you say it. At a hardware
level, it is going to be equivalent to a pass. It may be accelerated by
some factor due to, say, being able to copy 4 or 8 bytes at a time, but
no matter what, it has to go over all that memory whether it's one assembly
instruction or not -- and as previously pointed out, there are well documented
cases of single dedicated assembly instructions being SLOWER than a
well-considered loop written in C.
True. But it's so cool using nothing but pointers.

Also such a rich source of possible bugs. You can test integers for
whether they're positive; you can't test a pointer for whether it's fallen
off the beginning of a string, for instance.

-s
 
S

Seebs

Why are you indugling in off topic rants?

He's not. You made the assertion that memmove was "probably" not
equivalent to a "pass" -- even though it's trivially obvious that
it MUST be in order for its operation to be carried out. Keith
correctly pointed out that this somewhat topical assertion was wrong.
His post is not a rant, and is in no way off topic; questions of
what memmove() is or isn't are pretty topical. Programmers who
wish to make good use of C should know that, while usually efficient,
memmove() is not a magical tool for bypassing performance
considerations. In particular, this magical "one assembly
instruction" is both implausible (in general) and misleading -- you
seem to think that performance is purely a function of number
of instructions, but again, in real-world cases, user-provided C
loops which compile into complicated loops in assembler have been
known to outperform dedicated hardware functions by a large margin.

-s
 
J

John Kelly

John Kelly wrote:
Ok. Gathering statistics seems like a fair suggestion.

I don't know how far you want to go with this, but please use the most
recent version. Here's Johnny ...





/*

Define author
John Kelly, October 6, 2007

Define copyright
Copyright John Kelly, 2007. All rights reserved.

Define license
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this work except in compliance with the License.
You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0

Define symbols and (words)
exam ......... temporary char *
hast ......... temporary char * to alpha !isspace
hath ......... temporary char **
keep ......... temporary char * to omega !isspace
trimlen ...... trimmed string length
ts ........... temporary char * to string
ts ........... temporary string []
tu ........... temporary size_t
xu ........... fail safe size_t


Define ideas: trim()

Trims leading and trailing whitespace from a null terminated string.
Reports failure and quits if a null terminator is not located within
system defined limit, beginning at the pointer argument.

Parameters
char * to string
NULL, or size_t * for trimlen result

On success returns 0

On failure returns -1 after setting errno to one of
EINVAL
EOVERFLOW


*/

# ifndef _GNU_SOURCE
# define _GNU_SOURCE 1
# endif

# if __STDC_VERSION__ >= 199901L
# include <stdint.h>
# endif
# include <stddef.h>
# include <stdlib.h>
# include <limits.h>
# include <ctype.h>
# include <errno.h>
# include <stdio.h>
# include <string.h>
# include <signal.h>
# include <time.h>

# ifndef SIZE_MAX
# define SIZE_MAX ((size_t)-1)
# endif

extern char **environ;

# ifndef PTRDIFF_MAX
static ptrdiff_t
ptrdiff_max_ (void)
{
ptrdiff_t last = 32767;

while (last < last * 2u + 2) {
last = last * 2 + 1;
}
return last;
}
# endif

static int
trim (char *const ts, size_t *trimlen)
{
size_t tu;
unsigned char *exam;
unsigned char *hast;
unsigned char *keep;
# ifdef PTRDIFF_MAX
ptrdiff_t const ptrdiff_max = PTRDIFF_MAX;
# else
ptrdiff_t const ptrdiff_max = ptrdiff_max_ ();
# endif
size_t const xu = ptrdiff_max < SIZE_MAX ? ptrdiff_max : SIZE_MAX;

if (!ts) {
errno = EINVAL;
return -1;
}
tu = 0;
exam = (unsigned char *) ts;
while (++tu < xu && isspace (*exam)) {
++exam;
}
if (tu == xu) {
errno = EOVERFLOW;
return -1;
}
tu = 0;
hast = keep = exam;
while (++tu < xu && *exam) {
if (!isspace (*exam)) {
keep = exam;
}
++exam;
}
if (tu == xu) {
errno = EOVERFLOW;
return -1;
}
if (*keep) {
*++keep = '\0';
}
tu = keep - hast;
if (hast != (unsigned char *) ts) {
(void) memmove (ts, hast, tu + 1);
}
if (trimlen) {
*trimlen = tu;
}
return 0;
}

/*
Test code courtesy c.l.c contributors
*/

static char s0[] = " I need trimming on both ends. ";
static char s1[] = "I need trimming on far end. ";
static char s2[] = " I need trimming on near end.";
static char s3[] = " I need more trimming on both ends. ";
static char s4[] = "\n\t\rI need special trimming on both ends.\n\t\r";
static char s5[] = " \n\t\r I need special trimming on near end.";
static char s6[] = "I need special trimming on far end. \n\t\r ";
static char s7[] = "I need no trimming";
static char s8[] = " ";
static char s9[] = "";
static char *strings[12] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL,
};

# define CHUNK 100000

int
main (void)
{
int i;
char *cp;
size_t length;

for (i = 0; i < 11; i++) {
printf ("Original string:[%s]\n", strings);
trim (strings, &length);
printf ("Trimmed string:[%s]\n", strings);
puts ("---------------------------------------");
}

/* insanity tests */
printf ("trim returned: %i\n", trim (NULL, NULL));
printf ("trim returned: %i\n", trim (NULL, &length));

cp = malloc (CHUNK);
if (cp) {
memset (cp, 'x', CHUNK);
memset (cp + CHUNK - 3, ' ', 1);
memset (cp + CHUNK - 2, ' ', 1);
memset (cp + CHUNK - 1, '\0', 1);
memset (cp + 0, ' ', 1);
memset (cp + 1, ' ', 1);
memset (cp + 2, ' ', 1);
} else {
printf ("malloc failed\n");
return (1);
}

printf ("Original big string length %u\n", strlen (cp));
printf ("trim returned: %i\n", trim (cp, &length));
printf ("New string length %u\n", strlen (cp));
printf ("Double check length returned %u\n", length);
free (cp);
printf ("trim returned %i on a freed block!\n", trim (cp, &length));
printf ("The length returned was %d\n", length);
puts ("\nSUCCESS testing trim function");
return 0;
}
 
S

Shao Miller

John said:
There are *not* two read passes. The full read is split among two
sections, but each section reads a separate piece. Seems like some
people are confused by the pointers. You have to read it carefully.
A second read pass in 'memmove' is what I meant.
Probably, memmove() is not equivalent to a "pass." If it's one assembly
instruction, zoom!
A write pass in 'memmove' is what I meant. I'd still consider copying
the memory to be a write pass. If the implementation did some
copy-on-write pointer remapping or other magic, then sure, the second
read pass and the write pass would be avoided.

The disconnect we had is clear now. :)
I was concerned about that, but it's all ironed out now. Thanks for the
help!
Thanks for interesting C discussion.
True. But it's so cool using nothing but pointers.

:)
Sure.

For reference, here's the code again:

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
char *i = string;
char *end = string;

/* Trim left */
while (isspace(*i))
++i;

/* Copy remaining string */
while (*string = *i) {
if (!isspace(*string))
end = string + 1;
++string;
++i;
}

/* Trim right */
*end = 0;

/* Return a pointer to the terminator */
return end;
}

char *trim(char *string) {
return string ? trim_unsafe(string) : string;
}

/**
* Testing trim function
*/
#include <stdio.h>

/* Handy for arrays */
#define NUM_OF_ELEMENTS(array_) \
(sizeof (array_) / sizeof *(array_))
#define FOR_EACH_ELEMENT(index_, array_) \
for ((index_) = 0; (index_) < NUM_OF_ELEMENTS(array_); (index_)++)

int main(void) {
char *tests[] = {
"",
" ",
" ",
"f",
" f",
" f",
" f ",
" f ",
"f ",
"f ",
"foo bar baz",
" foo bar baz",
" foo bar baz",
" foo bar baz ",
" foo bar baz ",
"foo bar baz ",
"foo bar baz "
};
int i;

FOR_EACH_ELEMENT(i, tests) {
char buf[80];
strcpy(buf, tests);
printf("BEFORE: \"%s\"\n", buf);
trim(buf);
printf(" AFTER: \"%s\"\n\n", buf);
}
/* printf("%p\n", trim(NULL)); */

return 0;
}
 
J

John Kelly

For reference, here's the code again:

It's clean, concise, and looks good.

But you have no safeguard against memory having no \0 byte. Your code
can go into an infinite loop. Mine cannot.

My code may not be as pretty, but I don't think you can match its run
time performance.
 
J

James

John Kelly said:
It's clean, concise, and looks good.

But you have no safeguard against memory having no \0 byte.

Well, neither do you John. If I pass your function memory that does not have
a terminating NUL byte, then it will gladly read right past the bounds and
off into never never land...

Your code
can go into an infinite loop. Mine cannot.

They will both launch nukes if I pass them non-NUL terminated memory.
 
J

John Kelly

Well, neither do you John. If I pass your function memory that does not have
a terminating NUL byte, then it will gladly read right past the bounds and
off into never never land...

You are thinking of C strings. There are no "bounds" to a pointer.
You can increment a pointer until it wraps back to 0, and repeat that
loop forever.

If you try to read that memory, you may segfault and terminate. But
that is beside the point. My code copes with an abstract machine where
one process owns all the memory, from 0 to the highest address.

That may not be a real world scenario, but again, that is beside the
point. My code copes with abstract near impossibilities. That's the
way a programmer should think.
 
S

Seebs

They will both launch nukes if I pass them non-NUL terminated memory.

I think you're talking to a wall here. Kelly's in all the world's a
VAX mode -- he's assuming a flat memory model where you can do whatever
you want to pointers, as long as you "avoid overflow", which is totally
incoherent -- but since he dismisses any platform which doesn't conform
to his /a priori/ assumptions as broken or uninteresting, it hardly
matters.

Basically, he figures that since the results he's seen so far didn't
include infinite loops, it can never happen.

-s
 
S

Seebs

You are thinking of C strings. There are no "bounds" to a pointer.
Incorrect.

You can increment a pointer until it wraps back to 0, and repeat that
loop forever.

No, you can't.

1. It is not necessarily the case that pointers wrap.
2. It is quite possible that your code will die trying to increment that
pointer.
If you try to read that memory, you may segfault and terminate.

Even forming the pointer may cause you to abort or otherwise do crazy
things.
But that is beside the point.

Not really.
My code copes with an abstract machine where
one process owns all the memory, from 0 to the highest address.

But not with a real machine where looking at memory you don't own
or even forming addresses for it can result in termination. And since
those machines actually exist, but I'm not aware of any machines
in production where you can actually use every possible address,
it seems to me you're rather missing the point.
That may not be a real world scenario, but again, that is beside the
point. My code copes with abstract near impossibilities. That's the
way a programmer should think.

Not when it comes at the expense of coping with common and everyday
systems that actually exist. Which in this case it does.

Designing for edge cases is good, but it's important to understand the
limitations and design of the language and systems you're working with,
and recognize what you can and can't fix. Coping with something which
is completely irrelevant, at the cost of additional complexity in your
code, increases the chances of bugs that actually cause failures on
real machines.

Congratulations. You've developed an automobile which is completely
secure against bipedal tigers armed with chainguns which fire
potatoes; as a side-effect of this crucial safety feature, the
car inevitably goes into an uncontrolled spin any time it encounters
a bit of ice.

-s
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top