trim whitespace

N

Nick Keighley

static void
trim (char **ts)
{
    unsigned char *exam;
    unsigned char *keep;

    if (!*ts) {
        errno = EINVAL;
        printf ("trim: %s\n", strerror (errno));
        return;
    }
    exam = (unsigned char *) *ts;
    while (isspace (*exam)) {
        ++exam;
    }
    *ts = (char *) exam;
    if (!*exam) {
        return;
    }
    keep = exam;
    while (*++exam) {
        if (!isspace (*exam)) {
            keep = exam;
        }
    }
    if (*++keep) {
        *keep = '\0';
    }

}

Anyone see bugs?  It's not a trick question, I use this code.  Just
wondering if I overlooked anything.

here's my attempt

static char *trim (char *result, const char *input)
{
const char *start;
const char *end;
size_t result_length;

assert (result != NULL);
assert (input != NULL);

for (start = input; *start != '\0' && isspace (*start); start++)
continue;

for (end = input + strlen (input) - 1; end > start && isspace
(*end); end--)
continue;

result_length = end - start + 1;

if (result_length > 0)
{
memcpy (result, start,result_length);
*(result + result_length) = '\0';
}
else
*result = '\0';

return result;
}
 
B

Ben Bacarisse

Keith Thompson said:
Ben Bacarisse said:
BruceS said:
... Ignoring all the other problems
with this idea, is it actually possible to have no 0 anywhere in
memory?

Well, either argc must == 0 or argv[0] must be a null-terminated string
giving the program name or argv[0][0] must be a null character. Of
course, there is no guarantee that this zero can be found by repeated
manipulation of an unrelated pointer.

Ah, but all-bits-zero is only *a* representation of 0, not
necessarily the *only* representation of 0.

Yes, but there are very limited options...
Suppose argc == 0, but the system uses a 1's-complement
representation and the value stored in argc is represented as
all-bits-1.

I think that's the only possibility[1] unless char and int are the same
size (when they are the same size, negative zero in the other two
permitted representation are also not all zero bits). That
representation is either a negative zero or it is a trap representation
and argc can't be the latter.

As a result it is only negative zero that can make argc == 0 without
there being CHAR_BITS of zero bits available to be seen[1 again!]. I
concluded that 6.2.6.2 p3 did not permit argc to use negative zero, at
least not initially. If it did, surely it would have to be mentioned
there since p3 seems to be about how and when negative zero might come
into your program.
errno is 0 at program startup, but the same thing applies.

errno highlights the absurdity of the original project. It might be
defined to be *__error_register_addr with that address being a special
one which can't be obtained by any manipulation of an "ordinary"
pointer. The same could apply to argc, of course and to any two other
pointers to separate objects!

<snip>

[1] I suspect there is another flaw in my reasoning since it is probably
possible that padding bits cold be peppered about the representation.
For example, int might be 4 bytes with only 28 value bits -- the
low-order 7 bits from each byte. There would then be lots of ways for
no byte to be zero no matter how many int zeros there were in the data.
 
B

Ben Bacarisse

Nick Keighley said:
here's my attempt

static char *trim (char *result, const char *input)
{
const char *start;
const char *end;
size_t result_length;

assert (result != NULL);
assert (input != NULL);

for (start = input; *start != '\0' && isspace (*start); start++)

That's a bit belt-and-braces since isspace('\0') is false. Also there
is the issue of signed chars. You need isspace((unsigned char)*start)
to stay the right side of the law.
continue;

for (end = input + strlen (input) - 1; end > start && isspace
(*end); end--)

That's undefined when the original string is empty.
continue;

result_length = end - start + 1;

if (result_length > 0)
{
memcpy (result, start,result_length);
*(result + result_length) = '\0';
}
else
*result = '\0';

This seems rather wordy -- the two arms do the same thing.

Did you mean to use memmove here, or are your users warned that they
can't pass the same pointer in both argument positions? I'd either use
memmove or I'd 'restrict' qualify the pointers so the prototype withh do
the warning for me.
 
N

Nick Keighley

That's a bit belt-and-braces since isspace('\0') is false.

good point

 Also there
is the issue of signed chars.  You need isspace((unsigned char)*start)
to stay the right side of the law.



That's undefined when the original string is empty.

and I thought of that when I started coding!

This seems rather wordy -- the two arms do the same thing.
doh!

Did you mean to use memmove here, or are your users warned that they
can't pass the same pointer in both argument positions?  I'd either use
memmove or I'd 'restrict' qualify the pointers so the prototype withh do
the warning for me.

I don't usually code to C99. But yes the documentation ought to warn
them. (or I use memmove()!)

the code was fairly thoughly tested and yet you found a number of
issues...
 
J

John Kelly

[1] I suspect there is another flaw in my reasoning since it is probably
possible that padding bits cold be peppered about the representation.
For example, int might be 4 bytes with only 28 value bits -- the
low-order 7 bits from each byte. There would then be lots of ways for
no byte to be zero no matter how many int zeros there were in the data.

That contradicts your earlier helpful advice:


In C99 you can work backwards. The XXX_MAX macros let you determine the
number of padding bits in the XXX type.


C99 is foobar[1].


[1] Foobar may have derived from the military acronym FUBAR and gained
popularity due to the fact that it is pronounced the same.
 
S

Seebs

[1] I suspect there is another flaw in my reasoning since it is probably
possible that padding bits cold be peppered about the representation.
For example, int might be 4 bytes with only 28 value bits -- the
low-order 7 bits from each byte. There would then be lots of ways for
no byte to be zero no matter how many int zeros there were in the data.
That contradicts your earlier helpful advice:

No it doesn't.

This does not contradict what he said above, because determining the NUMBER
of padding bits doesn't determine their LOCATION.

Take the example of the 4-7-bit-values int. You can determine by looking
at INT_MAX that it's a 28-bit value, but you can't determine which 28 bits.

-s
 
B

BruceS

Realistically, of course, there's almost certain to be a zero byte
*somewhere* in memory -- not that it matters.

OK, it's "almost" certain. Since the problem includes needing
absolute certainty about what happens with code invoking UB, and
requires use of an abstract machine that doesn't have to follow any
standards, I'll take the "almost" as a "no". If we use our
imaginations enough to make Mr. Kelly's problem a meaningful one, we
have to accept the possibility that getting input that isn't a proper
string, and then "wrapping" the pointer, could lead to an infinite
loop. In this case, truly infinite, since we're far enough under hill
that the imagined machine could run forever with no power, and outlast
time itself.

Thanks, Keith and Ben. I thought I had certainty almost within grasp,
but the void between absolute and near certainty is like that between
the infinite and the really big.
 
S

Seebs

OK, it's "almost" certain.

I think it would be an interesting exercise to try to build a machine
and operating system for which Kelly's code actually accomplishes something.
At a bare minimum, we need an environment which doesn't crash when you
get outside your own memory space, and where there aren't a lot of \0s
floating around, such that it is ever possible to keep running that far,
and so on.

Here's my theory. If you gave me about $50M for development budget, I
could hire a team of silicon designers who would be able to design a custom
CPU with the necessary traits, and we could probably get a hacked-up port
of some operating system running on it. My guess is that the simplest
way to do it would be to build something that intercepts all attempts by
a process to read memory, and creates a virtual address space for that
process populated entirely by \xff, and then maintains separation of code
and data; we then give it a "data" space such that the object we pass
to trim happens to be the very last object so that all the space after
it is \xff, and make sure that the object isn't itself null-terminated.

But I think we can all agree that $50M and a few years of engineering time
are a small price to pay to support Kelly's ceaseless efforts to denigrate
languages, computers, and operating systems currently available.

-s
 
K

Keith Thompson

John Kelly said:
[1] I suspect there is another flaw in my reasoning since it is probably
possible that padding bits cold be peppered about the representation.
For example, int might be 4 bytes with only 28 value bits -- the
low-order 7 bits from each byte. There would then be lots of ways for
no byte to be zero no matter how many int zeros there were in the data.

That contradicts your earlier helpful advice:


In C99 you can work backwards. The XXX_MAX macros let you determine the
number of padding bits in the XXX type.

I don't see a contradiction.

For a given signed type t, sizeof(t) * CHAR_BIT gives you the
total number of bits (include value bits, padding bits if any,
and the single sign bit), and you can compute the number of value
bits from T_MAX. Doing this latter calculation at compile time is
challenging; doing it at run time is easy. C90's problem is that T_MAX
isn't defined for all predefined typedefs T; it is for C99.

Ben's "flaw in my reasoning" paragraph was about the possibility of
storing a 0 value of some integer type without storing a 0 byte,
not about determining the number of padding bits for a given type.

[...]
 
J

John Kelly

[1] I suspect there is another flaw in my reasoning since it is probably
possible that padding bits cold be peppered about the representation.
For example, int might be 4 bytes with only 28 value bits -- the
low-order 7 bits from each byte. There would then be lots of ways for
no byte to be zero no matter how many int zeros there were in the data.
For a given signed type t, sizeof(t) * CHAR_BIT gives you the
total number of bits (include value bits, padding bits if any,
and the single sign bit), and you can compute the number of value
bits from T_MAX.

But you can't derive a mask for the padding bits. You have no standard
method of knowing where they are. That's a hole in the standard.
 
S

Seebs

But you can't derive a mask for the padding bits. You have no standard
method of knowing where they are. That's a hole in the standard.

You also can't force the C compiler to tell you where you can get a really
nice, authentic, Italian dinner.

It is arguably a hole in the standard, but the thing is, it really turns
out not to be a problem for sane code. If you're writing code that needs
to know, you aren't writing anything that could meaningfully be portable
anyway, so why expect the standard to help you?

-s
 
K

Keith Thompson

John Kelly said:
On Fri, 27 Aug 2010 13:45:32 +0100, Ben Bacarisse wrote:
[1] I suspect there is another flaw in my reasoning since it is probably
possible that padding bits cold be peppered about the representation.
For example, int might be 4 bytes with only 28 value bits -- the
low-order 7 bits from each byte. There would then be lots of ways for
no byte to be zero no matter how many int zeros there were in the data.
For a given signed type t, sizeof(t) * CHAR_BIT gives you the
total number of bits (include value bits, padding bits if any,
and the single sign bit), and you can compute the number of value
bits from T_MAX.

But you can't derive a mask for the padding bits. You have no standard
method of knowing where they are. That's a hole in the standard.

We were talking about determining the *number* of padding bits, not
their location.

You could do something like this:

For each power of 2 from 1 not exceeding T_MAX
Use memset to zero-fill an object of type T
Store the power of 2 in the object
Use mempcy to copy the object's representation to an array of
unsigned char
Examine the unsigned char values to determine which bit was set

This assumes that the padding bits are always 0; it's possible that
storing a value will also set some of the padding bits to 1.

But ok, it's something the standard doesn't give you an easy way
to do, and possibly doesn't give you any way to do. So what?
You can call it a "hole in the standard", but how serious is it?
If the standard had a mechanism to determine exactly which bits
are padding bits, how many pages would it take to describe it,
and would it be worth the effort?

The standard makes guarantees about the *values* of integer objects.
If you care about the bits, you can use arrays of unsigned char.
What more do you need?
 
D

David Thompson

I consider it a mistake to define any new functions with errno as their
error-reporting mechanism, unless you are the kernel or libc implementor.
And maybe not even then.
Look at some higher-level libraries and how they handle error codes. Most of
them don't touch errno. zlib, BerkeleyDB, and OpenSSL all have error codes,
but they leave errno alone.
A minor point: OpenSSL doesn't _set_ errno to indicate its errors; it
mostly returns a generic indication e.g. SSL_read(...) == -1 and has
the error information (uint32 of 3 bitfields + sometimes other stuff)
in a (perthread) data structure accessed by ERR_get_error etc. But for
SSL_ calls that map to socket I/O using the 'standard' bio_sock*, it
does _clear_ errno before doing socket I/O so that if a socket error
occurs OpenSSL returns (only) the error flag and the app can and
usually should look at errno -- or on Windows [WSA]GetLastError().

(* This isn't the only use of OpenSSL. You can use libssl with a
different I/O layer instead of bio_sock; and you can use lots of the
libcrypto part without using libssl at all. Such other uses _that I've
looked at_ just use the ERR_ mechanism and no errno.)
Even getaddrinfo(), as a fresh addition to libc, didn't mess with errno. It
returns an error code and provides a separate gai_strerror() to map the error
code to a string.

FWIW, the original netdb stuff like gethostbyname uses h_errno instead
though the 'real' socket calls from connect/accept 'down' are errno.

But pthreads since ~1995 use return values.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top