trim whitespace

K

Keith Thompson

Nick Keighley said:
so? Before you carry out an operation that might effect errno then
zero it. Afterwards test if it is still zero. No standard library
function is permitted to set errno to zero.

But they're permitted to set it to a non-zero value even on success.
Check the value of errno *only* after a function indicates failure
by some other means.
so what? Just assume it does.

Why assume anything? malloc() returns NULL on failure; that's all you
need to know.

[...]
 
J

John Kelly

Define symbols and (words)
exam ......... temporary char *
hast ......... temporary char * to alpha !isspace
hath ......... temporary char **
keep ......... temporary char * to omega !isspace
trimlen ...... trimmed string length
ts ........... temporary char * to string
ts ........... temporary string []
tu ........... temporary size_t
xu ........... fail safe size_t

Why don't you just your variables meaningful names, rather then having
to annotate them? It's bad enough in old C having to scroll to the
top to find declarations, without having to go even further to see
what they do.

"temporary char *" is not much help. The simplest code inspection tells
me it's a char * and the fact that it is local to function makes it
temporary. It does not help that the types are not exact (exam is, for
example, unsigned char *)

The purpose is to provide a hint about its use, not fully specify it.

and ts seems to have two types.

That's a cut and paste from other code where I use ts differently, in
different sections of code. I reuse short names that way. Which one is
which, depends on the context.

I appreciate your attention to detail, and your advice has been helpful.
But I don't think we can agree on every point.
 
K

Keith Thompson

John Kelly said:
John Kelly said:
[...]
memmove() is likely to be faster than an equivalent C loop, but
only by a constant factor, and probably a fairly small one.

Why are you indugling in off topic rants?

I'm curious -- what led you to think that the above was either off topic

You seem to have a stringent view of topicality.

More than some, yes. What I wrote above was, I believe, quite
topical. If you think my views are too stringent, just say so.
Accusing me of being off-topic when I'm not doesn't get your point
across very effectively.
A constant factor is significant when scaled to large use. It's easy to
overlook that.

I'm perfectly well aware of that, and I specifically acknowledged
that "memmove() is likely to be faster than an equivalent C loop".

I wasn't claiming that memmove() is slower than an explicit loop.
I was refuting your claim that

Probably, memmove() is not equivalent to a "pass." If it's one
assembly instruction, zoom!

[...]
 
J

John Kelly

The world of 64-bit integers also introduces another problem: almost
infinite loops (if you'll permit such a monstrous phrase). I don't see
the point in preventing unbounded loops when waiting for a loop to hit
PTRDIFF_MAX might take a year or two.

In the abstract, you may have a much faster processor. But I see that
programming abstract machines is not interesting to everyone.
 
J

John Kelly

More than some, yes. What I wrote above was, I believe, quite
topical. If you think my views are too stringent, just say so.

People think as they please. I'm not trying to change your mind.

Accusing me of being off-topic when I'm not doesn't get your point
across very effectively.

I don't think you care about my point of view, no matter how I make a
point.
 
S

Seebs

what about blit chips? Could they beat ordinary assemble? You'd think
doing it completely in hardware *could* be quicker

This is where you get into truly crazy performance questions.

On the last machine I used that had a hardware blitter:

1. The blitter was mildly slower than the slowest CPUs used, and a LOT
slower than the fastest CPUs used.
2. Since the blitter could run simultaneously with the CPU, it was usually
much faster to use the blitter asynchronously while doing other processing
on the CPU.

One of the reasons performance analysis is usually pointless is that there
is more than one meaningful "performance" question to be asked. Making a
single task faster may make other tasks slower...

-s
 
S

Seebs

In the abstract, you may have a much faster processor. But I see that
programming abstract machines is not interesting to everyone.

FWIW, I pretty much code exclusively to the abstract machine. However,
the abstract machine of C explodes, killing all passengers, when you
create an invalid pointer.

-s
 
S

Seebs

I was refuting your claim

And this would imply that he's wrong, which is why he immediately dismissed
your post as an "off-topic rant". If a statement has the logical implication
that he's not superior and excellent, it is instantaneously converted into
an irrelevant statement that you shouldn't have made because it had no
relevant semantic content. It's not that he disagrees with such statements;
it's that he can't even consciously perceive them.

-s
 
K

Kenny McCormack

People think as they please. I'm not trying to change your mind.



I don't think you care about my point of view, no matter how I make a
point.

You two really, really, really need to get a room.
 
J

John Kelly

In addition to a vector table at the beginning of memory, M68000 had
(has, I guess; it isn't really dead) 24-bit addressing accessed through
32-bit registers. The problems of that detail may fall under the "avoid
overflow" proviso, I guess...

My code is not attempting to prevent overflow in the integer sense of
the word.

I used the EOVERFLOW symbolic constant because it suggests the idea of
something too large, or out of bounds. In this case, an infinite loop.

[ELOOP]
Too many levels of symbolic links.

Could possibly substitute, but "Too many levels of symbolic links" does
not clearly suggest the general idea of an infinite loop.

My choices are limited to common POSIX symbolic constants. Finding a
more precise connotation would require making a new one up.

EUNBOUNDED perhaps.
 
S

Seebs

My code is not attempting to prevent overflow in the integer sense of
the word.

In which case, EOVERFLOW is the wrong errno.
I used the EOVERFLOW symbolic constant because it suggests the idea of
something too large, or out of bounds. In this case, an infinite loop.

That's not really how errno values are used. They are generally fairly
specific.

I usually treat "something went horribly wrong with memory" as EFAULT, since
that's the canonical errno value for access to memory you don't own... At
least in Unixy systems.
[ELOOP]
Too many levels of symbolic links.
Could possibly substitute, but "Too many levels of symbolic links" does
not clearly suggest the general idea of an infinite loop.

Indeed it does not.
My choices are limited to common POSIX symbolic constants. Finding a
more precise connotation would require making a new one up.

I suggest you use EISOLVEDTHEHALTINGPROBLEM.

-s
 
J

John Kelly

The world of 64-bit integers also introduces another problem: almost
infinite loops (if you'll permit such a monstrous phrase). I don't see
the point in preventing unbounded loops when waiting for a loop to hit
PTRDIFF_MAX might take a year or two.

For the likely uses cases of trim(), at some point the size of a string
does become absurd.

I could allow an optional compile time constant named TRIM_MAX.

If defined, the loop limit would be the lesser of

TRIM_MAX, SIZE_MAX, PTRDIFF_MAX.

else if not defined, the lesser of

SIZE_MAX, PTRDIFF_MAX

Thus the user could define a practical limit, in terms of whatever seems
practical to them. Otherwise, the system limits would apply.

But that in no way diminishes the value of programming for the abstract.
 
J

John Kelly

Sure, if you've got the right abstraction.

Just because you don't like it doesn't mean it's not right.

Why don't you devise one and post it? Never mind, I think I already
know that answer to that question.
 
S

Shao Miller

John said:
It's clean, concise, and looks good.

But you have no safeguard against memory having no \0 byte. Your code
can go into an infinite loop. Mine cannot.
I'd call that user error. They should pass a pointer to a string. If
you attempt to prevent an infinite loop by choosing an upper bound, then
you have the impossibility of success for a programmer trying to trim a
string where that constraint isn't met.

We'll likely just disagree about it; I think the responsibility should
be with the programmer using the function, so that they can trim any
string using well-defined operations, such as pointer arithmetic without
overflow.
My code may not be as pretty,
Says who? :)
but I don't think you can match its run
time performance.
I don't know of a guarantee that the read and write passes (if any) in
'memmove' are faster than one's own read and write passes.

Without regarding security, the possibility that 'memmove' might be
slower than one's own single read and write passes, and prevention of an
infinite loop because of a programmer error, here is another version
which calls 'memmove' multiple times and uses an arbitrary window size
of one's choice (up to the maximum value that can fit inside a
'size_t'). It'll keep trying to trim forever if no null terminator is
found, but it should allow one to trim a string many times greater than
'SIZE_MAX'. Since 'SIZE_MAX' isn't in C89, I guess one would have to
pick a window size by some other means for such an implementation.

Also at http://codepad.org/D4JqjCNf

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>
#include <stdint.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
#define TRIM_WINDOW_SIZE SIZE_MAX
char *right = string;
char c;
char *left = string;
size_t window = 0;
int found_left = 0;

while (*right) {
c = left[window];
if (!isspace(c)) {
/* Do not search for the left trim-point */
found_left = 1;
if (c)
/* Note a possible right trim-point */
right = string + window + 1;
}
if (!found_left) {
/* Search for the left trim-point */
++left;
continue;
}
if (!c || window == TRIM_WINDOW_SIZE) {
/* Prevent wasted effort */
if (left != string)
memmove(string, left, window);
/* Advance source and destination windows */
string += window;
left += window;
window = 0;
} else
++window;
if (!c)
/* Trim the right and we're done */
*right = 0;
}

/* Return a pointer to the terminator */
return right;
#undef TRIM_WINDOW_SIZE
}

char *trim(char *string) {
return string ? trim_unsafe(string) : string;
}

/**
* Testing trim function
*/
#include <stdio.h>

/* Handy for arrays */
#define NUM_OF_ELEMENTS(array_) \
(sizeof (array_) / sizeof *(array_))
#define FOR_EACH_ELEMENT(index_, array_) \
for ((index_) = 0; (index_) < NUM_OF_ELEMENTS(array_); (index_)++)

int main(void) {
char *tests[] = {
"",
" ",
" ",
"f",
" f",
" f",
" f ",
" f ",
"f ",
"f ",
"foo bar baz",
" foo bar baz",
" foo bar baz",
" foo bar baz ",
" foo bar baz ",
"foo bar baz ",
"foo bar baz "
};
int i;
char *cp;

FOR_EACH_ELEMENT(i, tests) {
char buf[80];
strcpy(buf, tests);
printf("BEFORE: \"%s\"\n", buf);
cp = trim(buf);
printf(" AFTER: \"%s\"\nTERMINATOR: %p\n\n", buf, cp);
}
/* printf("%p\n", trim(NULL)); */

return 0;
}
 
K

Keith Thompson

John Kelly said:
Just because you don't like it doesn't mean it's not right.

Why don't you devise one and post it? Never mind, I think I already
know that answer to that question.

Because I don't care as much about the problem you're trying to solve as
you do. Depending on how you define the problem on a given day, I find
it to be insoluble or uninteresting, perhaps both. I'm willing to offer
occasional suggestions and to encourage you to state the problem more
clearly, but not much more.

And if I did devise some abstraction, it's not likely you'd accept it
anyway.
 
S

Seebs

Because I don't care as much about the problem you're trying to solve as
you do.

IMHO, it's some combination of not specific enough and too specific. I'd
point out that a function like this is common in string-processing languages,
where it's idiomatic and you don't have to answer questions like "whose
job is it to validate that the input is a string".

-s
 
S

Shao Miller

Shao said:
John said:
It's clean, concise, and looks good.

But you have no safeguard against memory having no \0 byte. Your code
can go into an infinite loop. Mine cannot.
I'd call that user error. They should pass a pointer to a string. If
you attempt to prevent an infinite loop by choosing an upper bound, then
you have the impossibility of success for a programmer trying to trim a
string where that constraint isn't met.

We'll likely just disagree about it; I think the responsibility should
be with the programmer using the function, so that they can trim any
string using well-defined operations, such as pointer arithmetic without
overflow.
My code may not be as pretty,
Says who? :)
but I don't think you can match its run
time performance.
I don't know of a guarantee that the read and write passes (if any) in
'memmove' are faster than one's own read and write passes.

Without regarding security, the possibility that 'memmove' might be
slower than one's own single read and write passes, and prevention of an
infinite loop because of a programmer error, here is another version
which calls 'memmove' multiple times and uses an arbitrary window size
of one's choice (up to the maximum value that can fit inside a
'size_t'). It'll keep trying to trim forever if no null terminator is
found, but it should allow one to trim a string many times greater than
'SIZE_MAX'. Since 'SIZE_MAX' isn't in C89, I guess one would have to
pick a window size by some other means for such an implementation.

Also at http://codepad.org/D4JqjCNf

/**
* Trim whitespace on the left and right of a string
*/
#include <stdlib.h>
#include <ctype.h>
#include <stdint.h>

/* Return a pointer to the terminator for the trimmed string */
static char *trim_unsafe(char *string) {
#define TRIM_WINDOW_SIZE SIZE_MAX
char *right = string;
char c;
char *left = string;
size_t window = 0;
int found_left = 0;

while (*right) {
c = left[window];
if (!isspace(c)) {
/* Do not search for the left trim-point */
found_left = 1;
if (c)
/* Note a possible right trim-point */
right = string + window + 1;
}
Oops. For better performance, put an 'else' right after the above right
curly brace.
if (!found_left) {
/* Search for the left trim-point */
++left;
continue;
}
if (!c || window == TRIM_WINDOW_SIZE) {
/* Prevent wasted effort */
if (left != string)
memmove(string, left, window);
/* Advance source and destination windows */
string += window;
left += window;
window = 0;
} else
++window;
if (!c)
/* Trim the right and we're done */
*right = 0;
}

/* Return a pointer to the terminator */
return right;
#undef TRIM_WINDOW_SIZE
}

char *trim(char *string) {
return string ? trim_unsafe(string) : string;
}

/**
* Testing trim function
*/
#include <stdio.h>

/* Handy for arrays */
#define NUM_OF_ELEMENTS(array_) \
(sizeof (array_) / sizeof *(array_))
#define FOR_EACH_ELEMENT(index_, array_) \
for ((index_) = 0; (index_) < NUM_OF_ELEMENTS(array_); (index_)++)

int main(void) {
char *tests[] = {
"",
" ",
" ",
"f",
" f",
" f",
" f ",
" f ",
"f ",
"f ",
"foo bar baz",
" foo bar baz",
" foo bar baz",
" foo bar baz ",
" foo bar baz ",
"foo bar baz ",
"foo bar baz "
};
int i;
char *cp;

FOR_EACH_ELEMENT(i, tests) {
char buf[80];
strcpy(buf, tests);
printf("BEFORE: \"%s\"\n", buf);
cp = trim(buf);
printf(" AFTER: \"%s\"\nTERMINATOR: %p\n\n", buf, cp);
}
/* printf("%p\n", trim(NULL)); */

return 0;
}
 
A

Alan Curry

I used the EOVERFLOW symbolic constant because it suggests the idea of
something too large, or out of bounds. In this case, an infinite loop.

Almost all of the standard (POSIX) uses of EOVERFLOW are about a file size
that is too large to fit into a signed 32-bit field, a problem which only
occurs in programs that aren't compiled with large file support (which makes
all the file size fields at least 64 bits).

There are some oddball uses... snprintf, for example, takes a buffer size as
a size_t but unfortunately has to return the result size in an int, which can
be a problem snprintf'ing a huge string whose length doesn't fit in an int,
and POSIX reports this as an EOVERFLOW.

In both of those cases, the function reporting the error is not claiming that
an operation failed for any internal reason, but that it is incapable of
reporting success because the successful result was too big to fit in the
specific integer type that the interface provides. "Value too large for
defined data type", as the strerror() says.
[ELOOP]
Too many levels of symbolic links.

Could possibly substitute, but "Too many levels of symbolic links" does
not clearly suggest the general idea of an infinite loop.

My choices are limited to common POSIX symbolic constants. Finding a
more precise connotation would require making a new one up.

EUNBOUNDED perhaps.

I consider it a mistake to define any new functions with errno as their
error-reporting mechanism, unless you are the kernel or libc implementor.

Look at some higher-level libraries and how they handle error codes. Most of
them don't touch errno. zlib, BerkeleyDB, and OpenSSL all have error codes,
but they leave errno alone.

Even getaddrinfo(), as a fresh addition to libc, didn't mess with errno. It
returns an error code and provides a separate gai_strerror() to map the error
code to a string.
 
B

Ben Bacarisse

John Kelly said:
I meant, a pointer can wrap to 0, and if your data has no terminating
\0, the pointer won't stop you either. You have a theoretical infinite
loop.

Yes, I know this is what you meant. I was referring to the bounds
within which you must keep to have defined behaviour.
Aside from the ptrdiff_t issue, how can it fail?

It could do all manner of things. For example, it might trim a string
that belongs to another process or thread. It might write it's trimming
null into a memory-mapped register that does almost anything you can
think of.
Segfaulting is a caller's error, not a defect in trim(). What other
failure scenario is there?

All of the above are caller errors but they make it hard to defend the
idea that the function is bullet-proof or has no undefined behaviour.

Why not save yourself some trouble and define whatever happens when the
pointer "wraps" as a caller error and then you can be done with this?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top