substring

Jeremy Yallop · Nov 5, 2003

Dan said:
An implementation doing array bounds checking *can* detect that the end
of the array has been reached without encountering any null character.
At this point, the implementation is free to do anything it wants,
including making demons fly out of your nose.

I find this a bit upsetting, if true. This means that we can have two
pointers that compare equal, one of which is known to point to a valid
object, and yet dereferencing the other has undefined behaviour. For
example, in the following, loop 2 has (according to the above)
undefined behaviour, while loop 3 does not.

char s1[3] = "123";
char s2[4] = "456";

if (s2 == s1 + sizeof s1) {
char *p = s1, *q = s2;

/* loop 1 */
for (; p != q; p++) {
putchar(*p);
}

assert (p == q);

/* loop 2 */
for (; p != s2 + sizeof s2; p++) {
putchar(*p);
}

/* loop 3 */
for (; q != s2 + sizeof s2; q++) {
putchar(*q);
}
}

Jeremy.

rihad · Nov 5, 2003

rihad said:
rihad said:

Given this:

char s[] = "123456", (*p3)[3] = &s;

Click to expand...

Incompatible pointer types. &s is of type (*)[6].

It's actually of type (char (*)[7]), but nontheless I hoped it would be a valid
assignment. Alas... Would be neat though if it were

is calling

printf("%s\n", p3[0]);

Click to expand...

Illegal, %s expects a pointer to char not a pointer to
char[3]

p3[0] is an expression of type (char [3]) which decays into (char *).

illegal, but

printf("%s\n", p3[1]);

Click to expand...

Same here.

Same here.

puts and printf are different because puts prints a string, while
printf explicitly takes a null terminated array.

Gosh! What is the difference between a string and a zero-terminated array (of
chars)?! Please help the desperate!

pete · Nov 5, 2003

rihad said:
rihad said:

&s1[0] points to an array of objects.
The array is ended by a ((char) 0).

Click to expand...

The array is terminated by a ((char)'3')

char s1[3] = "123";

Click to expand...

Click to expand...

The array of objects terminated by a ((char) 0), not s1.
char s1[3] = "123";
char s2[4] = "456";

s1 and s2 are two distinct arrays.
s2, is not part of s1.
s2 ends in a null character.
s1 ends in '3'.

rihad · Nov 5, 2003

Dan said:
Dan said:

An implementation doing array bounds checking *can* detect that the end
of the array has been reached without encountering any null character.
At this point, the implementation is free to do anything it wants,
including making demons fly out of your nose.

Click to expand...

I find this a bit upsetting, if true. This means that we can have two
pointers that compare equal, one of which is known to point to a valid
object, and yet dereferencing the other has undefined behaviour. For
example, in the following, loop 2 has (according to the above)
undefined behaviour, while loop 3 does not.

char s1[3] = "123";
char s2[4] = "456";

if (s2 == s1 + sizeof s1) {
char *p = s1, *q = s2;

/* loop 1 */
for (; p != q; p++) {
putchar(*p);
}

assert (p == q);

/* loop 2 */
for (; p != s2 + sizeof s2; p++) {
putchar(*p);
}

/* loop 3 */
for (; q != s2 + sizeof s2; q++) {
putchar(*q);
}
}

If loop 2 is undefined, there's no point in living. Thanks for the eye-opening
example, Jeremy.

rihad · Nov 5, 2003

I don't believe this is true. Consider the following text from C99
7.1.4 ("Use of library functions"):

If a function argument is described as being an array, the pointer
actually passed to the function shall have a value such that all
address computations and accesses to objects (that would be valid if
the pointer did point to the first element of such an array) are in
fact valid.

In the library section of the standard the word "array" is just a
convenient shorthand to denote array-like objects (including the
object returned from malloc(), for example). You can't draw any
conclusions from the fact that the description of fprintf() uses the
word "array" to describe the pointer-to-string passed as argument and
the description of puts() doesn't.

That's what I've felt since my first followup to Dan Pop! Maybe I haven't been
thinking in terms of the standard's wording but nontheless I'm glad to see that
you happen to share my opinion, even though yours is far more educated, while
mine is based on what "makes sense to me"

Robert Stankowic · Nov 6, 2003

Dan Pop said:
In <[email protected]> "Robert

Of course. Any implementation doing array bound checking *properly*
should object in the s1/s2 case. The big challenge of such an
implementation is NOT to object to the puts call.

Thank you for the clarification
regards
Robert

Dan Pop · Nov 6, 2003

In said:
I don't believe this is true. Consider the following text from C99
7.1.4 ("Use of library functions"):

If a function argument is described as being an array, the pointer
actually passed to the function shall have a value such that all
address computations and accesses to objects (that would be valid if
the pointer did point to the first element of such an array) are in
fact valid.

In the library section of the standard the word "array" is just a
convenient shorthand to denote array-like objects (including the
object returned from malloc(), for example).

The word "array" being defined by the standard, cannot be interpreted in
any other way when used in the standard. The object returned by malloc
satisfies the standard's definition of array.

You can't draw any
conclusions from the fact that the description of fprintf() uses the
word "array" to describe the pointer-to-string passed as argument and
the description of puts() doesn't.

Of course you can. If you ignore the definitions of the terms used by
the standard in a purely arbitrary way (i.e. according to your own
preconceptions about the language), the standard becomes a useless
document.

The *real* issue is whether the current wording of the standard accurately
reflects the intent of those who wrote it. If it doesn't, the wording
needs to be fixed, but until then, one cannot take arbitrary liberties
in interpreting the text of the standard.

Dan

Dan Pop · Nov 6, 2003

In said:
I find this a bit upsetting, if true. This means that we can have two
pointers that compare equal, one of which is known to point to a valid
object, and yet dereferencing the other has undefined behaviour.

Yup, C99 *explicitly* mentions this possibility:

6 Two pointers compare equal if and only if both are null pointers,
both are pointers to the same object (including a pointer to an
object and a subobject at its beginning) or function, both are
pointers to one past the last element of the same array object,
or one is a pointer to one past the end of one array object and
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
the other is a pointer to the start of a different array object
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
that happens to immediately follow the first array object in
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
the address space.91)
^^^^^^^^^^^^^^^^^
____________________

91) Two objects may be adjacent in memory because they are
adjacent elements of a larger array or adjacent members of
a structure with no padding between them, or because the
implementation chose to place them so, even though they
are unrelated. If prior invalid pointer operations (such as
accesses outside array bounds) produced undefined behavior,
subsequent comparisons also produce undefined behavior.

For
example, in the following, loop 2 has (according to the above)
undefined behaviour, while loop 3 does not.

char s1[3] = "123";
char s2[4] = "456";

if (s2 == s1 + sizeof s1) {
char *p = s1, *q = s2;

Stylistic issue: the code is much more readable if you name the pointers
p1 and p2, to be consistent with the way they are initialised.

/* loop 1 */
for (; p != q; p++) {
putchar(*p);
}

assert (p == q);

What for?!? Don't you trust the compiler to get the exit condition from
loop1 right or do you suspect that both != and == can evaluate to false
on the same pointer operands?

/* loop 2 */
for (; p != s2 + sizeof s2; p++) {
putchar(*p);
}

You can increment p one past the end of its object, but the
result cannot be either dereferenced or further incremented.

8 When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If
the pointer operand points to an element of an array object,
and the array is large enough, the result points to an element
offset from the original element such that the difference of the
subscripts of the resulting and original array elements equals
the integer expression. In other words, if the expression P
points to the i-th element of an array object, the expressions
(P)+N (equivalently, N+(P)) and (P)-N (where N has the value n)
point to, respectively, the i+n-th and i-n-th elements of the
array object, provided they exist. Moreover, if the expression
P points to the last element of an array object, the expression
(P)+1 points one past the last element of the array object, and
if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the
array object. If both the pointer operand and the result point to
elements of the same array object, or one past the last element of
the array object, the evaluation shall not produce an overflow;
otherwise, the behavior is undefined. If the result points one
^^^^^^^^^^^^^^^^^^^^^^^^
past the last element of the array object, it shall not be used
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
as the operand of a unary * operator that is evaluated.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There is no possible doubt that, according to the standard, your code
invokes undefined behaviour.

/* loop 3 */
for (; q != s2 + sizeof s2; q++) {
putchar(*q);
}
}

No problems here.

Imagine that you were writing a bounds checking implementation. It is
obvious, from these quotes, that pointer equality checking would have to
ignore the bounds information, but the indirection operator would have to
take it into account, as well as the addition and subtraction operators.

If your implementation would silently execute loop2, it would fail to
report a bound violation related invocation of undefined behaviour.

This is a typical example of how a very common mental image of the C
language is at odds with the C standard. Most people would expect loop2
to work and it will work on most (if not all) implementations without
bounds checking, but it will work by accident, not by design.

Dan

Jeremy Yallop · Nov 6, 2003

Dan said:
The word "array" being defined by the standard, cannot be interpreted in
any other way when used in the standard.

I'm not sure why you say that. The section that I quoted clearly
states that "array" has a broader sense when used in the library
section.

Of course you can. If you ignore the definitions of the terms used by
the standard in a purely arbitrary way (i.e. according to your own
preconceptions about the language), the standard becomes a useless
document.

There's nothing arbitrary about making use of the explicit exception
given in the introduction to the library section of the standard
(quoted above). The word "array" is used in the library section for a
data pointer on which certain operations are valid. Here's another
example:

size_t fread(void * restrict ptr,
size_t size, size_t nmemb,
FILE * restrict stream);

The fread function reads, into the array pointed to by ptr [...]

Now, the following is perfectly valid, although `a' is not an array.

int a;
fread(&a, sizeof a, 1, fp);

Were it not for the exception quoted above such an interpretation
might be questionable. As it is, it's the only reasonable way to
interpret this aspect of the standard.

Jeremy.

Jeremy Yallop · Nov 6, 2003

Dan said:
Yup, C99 *explicitly* mentions this possibility:

It seems that you're right. It is pretty counterintuitive (if you
have the wrong intuitions, I suppose).

What for?!? Don't you trust the compiler to get the exit condition from
loop1 right or do you suspect that both != and == can evaluate to false
on the same pointer operands?

It was just for documentation, really. Perhaps a comment would have
been clearer. I didn't expect the assertion to fail (but then I don't
write assertions that I expect to fail).

Jeremy.

Dan Pop · Nov 6, 2003

In said:
I'm not sure why you say that. The section that I quoted clearly
states that "array" has a broader sense when used in the library
section.

I don't see any broader sense in your quote. The restrictions about
address computations are exactly the same as those defined in the
paragraph dealing with pointer arithmetic.

There's nothing arbitrary about making use of the explicit exception
given in the introduction to the library section of the standard
(quoted above). The word "array" is used in the library section for a
data pointer on which certain operations are valid.

Precisely my point!

Here's another example:

size_t fread(void * restrict ptr,
size_t size, size_t nmemb,
FILE * restrict stream);

The fread function reads, into the array pointed to by ptr [...]

Now, the following is perfectly valid, although `a' is not an array.

int a;
fread(&a, sizeof a, 1, fp);

Every scalar can be considered as either an array of 1 of its type, or as
an array of sizeof(scalar) unsigned characters. This is explained in
other parts of the standard.

Were it not for the exception quoted above such an interpretation
might be questionable. As it is, it's the only reasonable way to
interpret this aspect of the standard.

No exception is needed. The standard explains how any object can be
accessed on a byte by byte basis by treating it as an array of characters.
This is enough for your example.

Dan

Jeremy Yallop · Nov 6, 2003

Dan said:
I don't see any broader sense in your quote. The restrictions about
address computations are exactly the same as those defined in the
paragraph dealing with pointer arithmetic.

Perhaps, but I was responding to the claim that:

puts and printf are different because puts prints a string, while
printf explicitly takes a null terminated array.

A null terminated "array" in the library section is no different from
a string. In particular, the address computations that can be
performed on each are precisely the same.

Every scalar can be considered as either an array of 1 of its type, or as
an array of sizeof(scalar) unsigned characters. This is explained in
other parts of the standard.

True. The guarantee is slightly stronger, I think: all the character
types can be used to access any object.

No exception is needed.

Well, why is it there, then? I agree that the guarantees elsewhere in
the standard can be taken as sufficient to allow the current meaning,
but I don't think that they're unambiguous enough.

Jeremy.

Thomas Stegen · Nov 6, 2003

Jeremy said:
puts and printf are different because puts prints a string, while
printf explicitly takes a null terminated array.

A null terminated "array" in the library section is no different from
a string. In particular, the address computations that can be
performed on each are precisely the same.

The library section defines a string as:

"7.1.1 Definitions of terms
1 A string is a contiguous sequence of characters terminated by and
including the first null character. [...] The length of a string is
the number of bytes preceding the null character and the value of a
string is the sequence of the values of the contained characters, in
order."

Seems to go to great lengths to avoid the term array i think.
Furthermore the descriptions of functions are very careful
where the term string is used and where the term array is used.

Not that the outcome of this discussion will have much effect
on my coding style

Jeremy Yallop · Nov 6, 2003

Thomas said:
Jeremy said:

puts and printf are different because puts prints a string, while
printf explicitly takes a null terminated array.

A null terminated "array" in the library section is no different from
a string. In particular, the address computations that can be
performed on each are precisely the same.

Click to expand...

The library section defines a string as:

"7.1.1 Definitions of terms
1 A string is a contiguous sequence of characters terminated by and
including the first null character. [...] The length of a string is
the number of bytes preceding the null character and the value of a
string is the sequence of the values of the contained characters, in
order."

Seems to go to great lengths to avoid the term array i think.
Furthermore the descriptions of functions are very careful
where the term string is used and where the term array is used.

It seems to me that "string" is used wherever the "array" argument is
null-terminated. This is entirely in keeping with the way these terms
are used elsewhere in the standard: "array" denotes the properties of
the object; "string" describes the value that the object has when
accessed as a sequence of char.

Consequently, the description for [f]printf() uses "array" rather than
"string" (in the 's' specifier section) because the argument is not
necessarily null-terminated. I don't think there's any other
significant difference between "string" and "array" in the library
section. "Array" tends to be used for output parameters for obvious
reasons.

For example:

size_t strxfrm(char * restrict s1,
const char * restrict s2,
size_t n);

The strxfrm function transforms the string pointed to by s2 and
places the resulting string into the array pointed to by s1.

Nobody can seriously claim that this description means that `s1' must
be an actual array whereas `s2' may be split across two or more
objects.

Jeremy.

Dan Pop · Nov 7, 2003

In said:
Perhaps, but I was responding to the claim that:

puts and printf are different because puts prints a string, while
printf explicitly takes a null terminated array.

A null terminated "array" in the library section is no different from
a string. In particular, the address computations that can be
performed on each are precisely the same.

This is the root of your misunderstanding. The definition of array you
have quoted yourself *explicitly* requires pointer arithmetic to work
inside the array.

The definition of string contains NO such requirement:

1 A string is a contiguous sequence of characters terminated by
and including the first null character. The term multibyte
string is sometimes used instead to emphasize special processing
given to multibyte characters contained in the string or to
avoid confusion with a wide string. A pointer to a string is a
pointer to its initial (lowest addressed) character. The length
of a string is the number of bytes preceding the null character
and the value of a string is the sequence of the values of the
contained characters, in order.

A direct consequence of this anomaly is that NO function expecting a
string parameter that is not explicitly required to be contained in an
array, cannot be *portably* implemented in C, because a C implementation
would necessarily rely on pointer arithmetic working inside the string.
But the definition of string quoted above provide no such guarantee.

Dan

Dan Pop · Nov 7, 2003

In said:
It seems that you're right. It is pretty counterintuitive (if you
have the wrong intuitions, I suppose).

It's not that counterintuitive to people familiar with segmented memory
systems. Imagine what happens if s1 is allocated at the end of a segment
and s2 at the beginning of another segment and there is one byte of
overlap (the first byte of s2) between the two segments...

Dan

Ben Pfaff · Nov 7, 2003

It's not that counterintuitive to people familiar with segmented memory
systems. Imagine what happens if s1 is allocated at the end of a segment
and s2 at the beginning of another segment and there is one byte of
overlap (the first byte of s2) between the two segments...

It's unclear to me why, if it can't make inter-segment pointer
arithmetic work properly, a compiler would go to the trouble of
ensuring that inter-segment pointer comparisons work properly.

Dan Pop · Nov 7, 2003

In said:
It's unclear to me why, if it can't make inter-segment pointer
arithmetic work properly, a compiler would go to the trouble of
ensuring that inter-segment pointer comparisons work properly.

Maybe because the compiler has nothing to do for that: the underlying
hardware may implement address comparisons this way.

Dan

Jeremy Yallop · Nov 10, 2003

Dan said:
This is the root of your misunderstanding.

I prefer "This is the point under discusssion".

The definition of array you have quoted yourself *explicitly*
requires pointer arithmetic to work inside the array.

The definition of string contains NO such requirement:

You may be right, although that would make "contiguous" a rather
unhelpful word to describe the bytes that contain a string. If two
pointers into a string cannot be compared for equality and if no valid
pointer arithmetic on one will yield a pointer equivalent to the other
then the bytes aren't contiguous in any useful sense. Just to be
clear, though, are you claiming that in the following:

#include <string.h>
char *strcpy(char * restrict s1, const char * restrict s2);

The strcpy function copies the string pointed to by s2 (including
the terminating null character) into the array pointed to by s1.

`s1' *must* point to a single object, whereas `s2' may point to two
adjacent objects spanned by a single string?

string parameter that is not explicitly required to be contained in an
array, cannot be *portably* implemented in C, because a C implementation
would necessarily rely on pointer arithmetic working inside the string.
But the definition of string quoted above provide no such guarantee.

Again, you may well be right according to the letter of the standard
but that this sort of absurdity is a consequence shows (to me) that
this is not its intent.

Jeremy.

Jeremy Yallop · Nov 10, 2003

Dan said:
This is the root of your misunderstanding.

I prefer "This is the point under discussion".

The definition of array you have quoted yourself *explicitly*
requires pointer arithmetic to work inside the array.

The definition of string contains NO such requirement:

You may be right, although that would make "contiguous" a rather
unhelpful word to describe the bytes that contain a string. If two
pointers into a string cannot be compared for equality and if no valid
pointer arithmetic on one will yield a pointer equivalent to the other
then the bytes aren't contiguous in any useful sense. Just to be
clear, though, are you claiming that in the following:

#include <string.h>
char *strcpy(char * restrict s1, const char * restrict s2);

The strcpy function copies the string pointed to by s2 (including
the terminating null character) into the array pointed to by s1.

`s1' *must* point to a single object, whereas `s2' may point to two
adjacent objects spanned by a single string?

string parameter that is not explicitly required to be contained in an
array, cannot be *portably* implemented in C, because a C implementation
would necessarily rely on pointer arithmetic working inside the string.
But the definition of string quoted above provide no such guarantee.

Again, you may well be right according to the letter of the standard
but that this sort of absurdity is a consequence shows (to me) that
this is not its intent.

Jeremy.

LEETCODE 3	3	Jun 22, 2024
How to speed this code	3	Nov 16, 2022
return the start of a substring in a string in c	70	Jul 14, 2007
substring assignment in fortran, C, etc.	46	May 19, 2009
Substring replacement	4	Nov 3, 2005
Weird Behavior with Rays in C and OpenGL	4	Feb 12, 2024
help getting substring	8	Jul 16, 2006
SubString() not working the way I expect, why?	1	Nov 10, 2009

substring

Jeremy Yallop

rihad

pete

rihad

rihad

Robert Stankowic

Dan Pop

Dan Pop

Jeremy Yallop

Jeremy Yallop

Dan Pop

Jeremy Yallop

Thomas Stegen

Jeremy Yallop

Dan Pop

Dan Pop

Ben Pfaff

Dan Pop

Jeremy Yallop

Jeremy Yallop

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads