trim whitespace, bullet proof version

B

Ben Bacarisse

John Kelly said:
The instances of setting errno and returning -1 are the contract
terms.

Ah, your code is correct by definition! I was asking about what you
thought you were protecting against so I could tell if you had met that
goal.

ssize_t tz counts the iterations looking for \0. SSIZE_MAX limits the
iterations looking \0. To be sure of calling memmove with a positive
number, the chosen limit must be such that (keep - hast) will always be
positive.

That's besides the point.
From what I understood of the ptrdiff discussion, ssize_t and SSIZE_MAX
satisfy the requirement for the memmove length to be positive . Or did
I misunderstand?

It seems so. Your code has:

tx = keep - hast;

Pointer subtraction produces a value of type prtdiff_t. You can't alter
that fact. If ssize_t has a narrower range that ptrdiff_t you have
introduced a case that can fail by using ssize_t. If it has a larger
range you will be able to generate pointers where the subtraction goes
wrong. There is no advantage at all in using ssize_t, and it has the
disadvantage that your code becomes dependent on a type that is not a
standard C type.
 
G

Geoff

OK.

I guess that means Windows would be a problem. What else?

Windows is not a problem, as I pointed out in your "ptrdiff_t
overflow/underflow (was: trim whitespace)" thread that you apparently
missed.

But you are using POSIX constants in what is supposed to be a standard
C function, apparently because your compiler implementation doesn't
support the standard SIZE_MAX and size_t so you used <sys/types.h>
which automatically exposes that you are introducing system
dependencies and you left that header out of your original source
listings but confessed to in your posts.

Using POSIX SSIZE_MAX limits your manageable string length to the
arbitrary limit of 32767 bytes so your function will automatically
fail for strings greater than this length. A problem you don't
document in your source.

You also declare static int trim(char **ts) but return ssize_t tz upon
success which implies an unsigned return type.

A truly robust, portable implementation of trim would be able to
survive usages or attacks like:

Test Kelly's trim function

#include <malloc.h>
#define CHUNK 100000

int main (void)
{
char *cp;
int status;

cp = malloc(CHUNK);
if (cp)
{
memset(cp,'x', CHUNK);
memset(cp+CHUNK-3,' ',1);
memset(cp+CHUNK-2,' ',1);
memset(cp+CHUNK-1,'\0',1);
}
else
{
printf("malloc failed\n");
return (1);
}
printf("%u\n", strlen(cp));
status = trim(&cp);
if (status < 0)
{
puts("Oopsie!");
}
printf("%u\n", strlen(cp));
free(cp);
return 0;
}
 
J

James

John Kelly said:
OK.

I guess that means Windows would be a problem. What else?

Mabey a hack like:


#if ! defined (SIZE_MAX)
# define SIZE_MAX ((size_t)(-1L))
#endif


could work out okay.
 
J

John Kelly

Windows is not a problem, as I pointed out in your "ptrdiff_t
overflow/underflow (was: trim whitespace)" thread that you apparently
missed.

It must be a POSIX extension on Windows. Linux seems to have it without
any special defines.

But you are using POSIX constants in what is supposed to be a standard
C function, apparently because your compiler implementation doesn't
support the standard SIZE_MAX and size_t so you used <sys/types.h>
which automatically exposes that you are introducing system
dependencies and you left that header out of your original source
listings but confessed to in your posts.

I didn't know which header it was in, until the question was asked.

Using POSIX SSIZE_MAX limits your manageable string length to the
arbitrary limit of 32767 bytes so your function will automatically
fail for strings greater than this length. A problem you don't
document in your source.

On Interix it's defined to INT_MAX which is 2147483647.

You also declare static int trim(char **ts) but return ssize_t tz upon
success which implies an unsigned return type.

/usr/include/sys/types.h:94:typedef signed int ssize_t;

ssize_t is signed. Presumably that's why it's called (s)size_t.

A truly robust, portable implementation of trim would be able to
survive usages or attacks like:

Test Kelly's trim function

#include <malloc.h>
#define CHUNK 100000

int main (void)
{
char *cp;
int status;

cp = malloc(CHUNK);
if (cp)
{
memset(cp,'x', CHUNK);
memset(cp+CHUNK-3,' ',1);
memset(cp+CHUNK-2,' ',1);
memset(cp+CHUNK-1,'\0',1);
}
else
{
printf("malloc failed\n");
return (1);
}
printf("%u\n", strlen(cp));
status = trim(&cp);
if (status < 0)
{
puts("Oopsie!");
}
printf("%u\n", strlen(cp));
free(cp);
return 0;
}

I'll see what I can do. So much feedback, so little time. I appreciate
the help.
 
J

John Kelly

It seems so. Your code has:

tx = keep - hast;

Pointer subtraction produces a value of type prtdiff_t. You can't alter
that fact.

I don't need to. SSIZE_MAX limits the iterations which increment keep,
so the distance between the two pointers will never overflow the lhs of
the assignment.

If ssize_t has a narrower range that ptrdiff_t you have
introduced a case that can fail by using ssize_t.

No, see above.

If it has a larger range you will be able to generate pointers where
the subtraction goes wrong.

Good point. I need to use the lesser of the two for my iteration limit.

There is no advantage at all in using ssize_t, and it has the
disadvantage that your code becomes dependent on a type that is not a
standard C type.

That can handled with macros, but I'll leave that for later.

I have some pending fixes and I'll post v3 soon.
 
J

James

John Kelly said:
trim whitespace, bullet proof version
[...]

Why not use a trim function that is much less invasive? Perhaps something
that returns the offset and length of the trimmed string residing within the
target string. Here is some example code:


http://codepad.org/PnIVzrUB
(this is the code and execution output)


<code>

#include <stddef.h>
#include <assert.h>
#include <limits.h>
#include <ctype.h>


#if ! defined (SIZE_MAX)
# define SIZE_MAX ((size_t)-1)
#endif


#define isspacex(c) isspace((unsigned char)(c))


int trim_ex(char const* buf,
size_t* phead,
size_t* plen)
{
size_t head, tail, tail_tmp;

if (! buf || ! buf[0]) return 0;

for (head = 0; isspacex(buf[head]); ++head)
if (! buf[head] || head == SIZE_MAX - 1UL) return 0;

for (tail = head, tail_tmp = head + 1; buf[tail_tmp]; ++tail_tmp)
{
if (tail_tmp == SIZE_MAX - 1UL) return 0;
if (! isspacex(buf[tail_tmp])) tail = tail_tmp;
}

assert(tail >= head && tail_tmp > head && tail <= tail_tmp);

if (phead) *phead = head;
if (plen) *plen = tail - head + 1;

return 1;
}




#include <stdio.h>


int trim_ex_test(char const* buf)
{
size_t head, len, i;

if (! trim_ex(buf, &head, &len)) return 0;

printf("origin: <%s>\n", buf);
printf("trimmed: <");

for (i = 0; i < len; ++i) putchar(buf[head + i]);

puts(">\n_______________________________________________\n");

return 1;
}


int main(void)
{
trim_ex_test(NULL);
trim_ex_test("");
trim_ex_test("Hello");
trim_ex_test("H");
trim_ex_test("H H H");
trim_ex_test(" H ");
trim_ex_test(" Hello ");
trim_ex_test("Hello ");
trim_ex_test(" Hello");
trim_ex_test(" Hello World 1 2 3 ");
trim_ex_test(" Hello World 1 2 3");
trim_ex_test("Hello World 1 2 3 ");
trim_ex_test("Hello World 1 2 3 4");

return 0;
}

</code>



You should be able to pass this a constant string that is longer than
SIZE_MAX without hitting any undefined behavior; the function just returns
failure. The trim_ex() function will stop scanning if the head or tail_tmp
variables stay within range of SIZE_MAX. What type of input, besides a
non-null terminated string, should break this? I hope there are no damn
bugs/typos in there. It look's alright to me at first glance...
 
J

James

James said:
You should be able to pass this a constant string that is longer than
SIZE_MAX without hitting any undefined behavior; the function just returns
failure. The trim_ex() function will stop scanning if the head or tail_tmp
variables stay within range of SIZE_MAX.

Ummm, the sentence above should read as:


The trim_ex() function will stop scanning if the head or tail_tmp variables
attempt to violate the range of SIZE_MAX.
 
J

John Kelly

Since strlen can handle a string 100k bytes long without trouble, as a
user of your library function I'd expect trim to handle whatever
string I threw at it without limitations. In order to get your code to
compile I have to throw this at it:

#define _POSIX_
#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <BaseTsd.h>

static
int trim (char **ts)
{
SSIZE_T tz;
unsigned char *exam;

...

and it fails to handle an arbitrarily large, properly terminated
string.

I'll see what I can do.

I also don't understand the need to report the error in errno, it
seems you are demanding the return value serve a dual purpose of
returning a reserved error value or the new length of the string.

That's a linux library and syscall tradition.

If
you are going to go to all the trouble to pass the address of the
string, why not add a second argument to the function to pass the
address of a variable for the new length of the string and return 0
for no error or the error code directly?

I don't know. Is that a Windows tradition?

int trim(char **ts, size_t *newlength){...}

This way you can still use your errno values and users can write:

if( trim(&string, &trimlength) == EINVAL)
puts("invalid string passed in");
else
if trim(&string, &trimlength) == EOVERFLOW)
puts("string overflowed or not terminated");

Then you can dispense with all this tz = 0; and
errno = EOVERFLOW;
return -1;

nonsense.

It makes sense on linux.
 
J

John Kelly

John Kelly said:
trim whitespace, bullet proof version
[...]

Why not use a trim function that is much less invasive? Perhaps something
that returns the offset and length of the trimmed string residing within the
target string. Here is some example code:

I'll try and take a look at this. Thanks for the feedback. It's hard
to keep up with it all.
 
J

James

James said:
John Kelly said:
trim whitespace, bullet proof version
[...]

Why not use a trim function that is much less invasive? Perhaps something
that returns the offset and length of the trimmed string residing within
the target string. Here is some example code:

[...]

ARGHGHG! I found a bug that crops up when you pass trim_ex() a string made
up of all whitespaces (e.g., " "). Here is the fixed version:


http://codepad.org/LjNJLQaw

<code>

#include <stddef.h>
#include <assert.h>
#include <limits.h>
#include <ctype.h>


#if ! defined (SIZE_MAX)
# define SIZE_MAX ((size_t)-1)
#endif


#define isspacex(c) isspace((unsigned char)(c))


int trim_ex(char const* buf,
size_t* phead,
size_t* plen)
{
size_t head, tail, tail_tmp;

if (! buf || ! buf[0]) return 0;

for (head = 0; isspacex(buf[head]); ++head)
if (head == SIZE_MAX - 1UL) return 0;

if (buf[head])
{
for (tail = head, tail_tmp = head + 1; buf[tail_tmp]; ++tail_tmp)
{
if (tail_tmp == SIZE_MAX - 1UL) return 0;
if (! isspacex(buf[tail_tmp])) tail = tail_tmp;
}

assert(tail >= head && tail_tmp > head && tail <= tail_tmp);

if (plen) *plen = tail - head + 1;
}

else if (plen)
{
*plen = 0;
}

if (phead) *phead = head;

return 1;
}




#include <stdio.h>


int trim_ex_test(char const* buf)
{
size_t head, len, i;

if (! trim_ex(buf, &head, &len)) return 0;

printf("origin: <%s>\n", buf);
printf("trimmed: <");

for (i = 0; i < len; ++i) putchar(buf[head + i]);

puts(">\n_______________________________________________\n");

return 1;
}


int main(void)
{
trim_ex_test(NULL);
trim_ex_test("");
trim_ex_test(" ");
trim_ex_test("Hello");
trim_ex_test("H");
trim_ex_test("H H H");
trim_ex_test(" H ");
trim_ex_test(" Hello ");
trim_ex_test("Hello ");
trim_ex_test(" Hello");
trim_ex_test(" Hello World 1 2 3 ");
trim_ex_test(" Hello World 1 2 3");
trim_ex_test("Hello World 1 2 3 ");
trim_ex_test("Hello World 1 2 3 4");

return 0;
}


</code>




Sorry about that damn non-sense!!!!!

;^(...



BTW, can you find any more bugs?

;^)
 
J

James

pete said:
Why
((size_t)(-1L))
instead of
((size_t)(-1))
?

What effect do you think that the 'L' has?

Umm, now that I think about it, I actually don't know why I put it in there!

;^o
 
G

Geoff

That's a linux library and syscall tradition.

Ah, more of that cargo cult stuff from UNIX, then.
Are we writing a Linux library or a C library function?
Is it really bullet proof?
I don't know. Is that a Windows tradition?

Just a better programming practice. Don't mix data with error returns.
Just because something was done in 1970 on a PDP 11, doesn't mean it
should be done in 2010 on every other platform.

Consider getchar(), how many times have novices been bitten by

char c;

c = getchar();
if (c == EOF)
....

Can you spot the error without looking it up? Can you tell me which
platforms it might work on?

errno is not thread safe, what will you do if the user is calling trim
in a multi-threaded scenario?
 
G

Geoff

Submitted for your amusement.

#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <string.h>

/***********************************************
Trim whitespace from a zero-terminated string.
Inputs:
the address of the string
the address for return of new string length.
Returns:
0 on success
EINVAL or EOVERFLOW on error.
************************************************/

int trim (char **ts, size_t *length)
{
size_t tz;
unsigned char *exam;
unsigned char *hast;
unsigned char *keep;

/* make sure caller is sane */
if (!ts || !*ts || !length) {
return EINVAL;
}
/* first pass, trim front of string */
tz = 0;
exam = (unsigned char *) *ts;
while (++tz < SIZE_MAX && isspace (*exam)) {
++exam;
}
if (tz == SIZE_MAX) {
return EOVERFLOW;
}
/* second pass, trim back of string */
tz = 0;
hast = keep = exam;
while (++tz < SIZE_MAX && *exam) {
if (!isspace (*exam)) {
keep = exam;
}
++exam;
}
if (tz == SIZE_MAX) {
return EOVERFLOW;
}
if (*keep) {
*++keep = '\0';
}
if (keep - hast < 0) {
return EOVERFLOW;
}

/* move the trimmed string to where caller expects it to be */
tz = keep - hast;
if (hast != (unsigned char *) *ts) {
(void) memmove (*ts, hast, tz + 1);
}

*length = tz; /* new length of string */
return 0;
}

/* Test the trim function */

static char s0[] = " I need trimming on both ends. ";
static char s1[] = "I need trimming on far end. ";
static char s2[] = " I need trimming on near end.";
static char s3[] = " I need more trimming on both ends. ";
static char s4[] = "\n\t\rI need special trimming on both ends.\n\t\r
";
static char s5[] = " \n\t\r I need special trimming on near end.";
static char s6[] = "I need special trimming on far end. \n\t\r ";
static char s7[] = "I need no trimming";
static char s8[] = " ";
static char s9[] = "";
static char *strings[12] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL,
};

#include <stdio.h>
#include <malloc.h>

#define CHUNK 100000

int main (void)
{
int i;
char *cp;
size_t length;

for (i = 0; i < 11; i++)
{
printf ("Original string:[%s]\n", strings);
trim (&strings, &length);
printf ("Trimmed string:[%s]\n", strings);
puts ("---------------------------------------");
}

/* insanity tests */
printf("trim returned: %i\n", trim(&cp, NULL));
printf("trim returned: %i\n", trim(NULL, &length));

cp = malloc(CHUNK);
if (cp)
{
memset(cp,'x', CHUNK);
memset(cp + CHUNK-3,' ',1);
memset(cp + CHUNK-2,' ',1);
memset(cp + CHUNK-1,'\0',1);
}
else
{
printf("malloc failed\n");
return (1);
}

printf("Original big string length %u\n", strlen(cp));
printf("trim returned: %i\n", trim(&cp, &length));
printf("New string length %u\n", strlen(cp));
printf("Double check length returned %u\n", length);
free(cp);
printf("trim returned %i on a freed block!\n", trim(&cp,
&length));
printf("The length returned was %d\n", length);
puts("\nSUCCESS testing trim function");
return 0;
}
 
J

John Kelly

Just a better programming practice. Don't mix data with error returns.

What's "better" depends. I like using this idiom:

if (trim(p) < 0)
printf ("we have a problem\n");

Just because something was done in 1970 on a PDP 11, doesn't mean it
should be done in 2010 on every other platform.

Are Windows idioms superior to UNIX idioms?

Consider getchar(), how many times have novices been bitten by

char c;

c = getchar();
if (c == EOF)
...

Can you spot the error without looking it up? Can you tell me which
platforms it might work on?
No.


errno is not thread safe, what will you do if the user is calling trim
in a multi-threaded scenario?

linux fork() is fast enough that many apps don't need threads. Someone
who needs threads can try writing trim_r().
 
J

John Kelly

Submitted for your amusement.

OK. I see. As for the idiom I mentioned:

What's "better" depends. I like using this idiom:

if (trim(p) < 0)
printf ("we have a problem\n");


What I meant to write was:

if ((newsize = trim(p)) < 0)
printf ("we have a problem\n");

where I set the variable and test for errors all in one line of code.

But I can see why you might prefer the second argument approach. I will
consider it. Adapting my idiom usage would be simple enough.
 
K

Kenny McCormack

errno is not thread safe, what will you do if the user is calling trim
in a multi-threaded scenario?
[/QUOTE]

The obvious CLC answer to this is: What are threads?

My point, of course, is that once you go off-topic, anything is
possible (in terms of what are acceptable answers).
linux fork() is fast enough that many apps don't need threads. Someone
who needs threads can try writing trim_r().

Indeed. The previous poster was playing the usual CLC game - which is
to dream up some obscure scenario where there could, conceivably, be a
problem, then act as if YOU are a fucking idiot for not covering that
case.

--
One of the best lines I've heard lately:

Obama could cure cancer tomorrow, and the Republicans would be
complaining that he had ruined the pharmaceutical business.

(Heard on Stephanie Miller = but the sad thing is that there is an awful lot
of direct truth in it. We've constructed an economy in which eliminating
cancer would be a horrible disaster. There are many other such examples.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top