how to replace a substring in a string using C?

P

Paul

hi, there,

for example,

char *mystr="##this is##a examp#le";

I want to replace all the "##" in mystr with "****". How can I do this?

I checked all the string functions in C, but did not find one.

thanks.
 
M

Mike Wahler

Paul said:
hi, there,

for example,

char *mystr="##this is##a examp#le";

I want to replace all the "##" in mystr with "****". How can I do this?

With the above construct, you can't. It's not
allowed to modify a string literal (more specifically,
an attempt to do so produces 'undefined behavior'.

However, if you change the code to store your string
in an array, it can be modified.
I checked all the string functions in C, but did not find one.

thanks.

#include <stdio.h>
#include <string.h>

char *replace(char *s, char old, char new)
{
char *p = s;

while(*p)
{
if(*p == old)
*p = new;

++p;
}

return s;
}

int main()
{
char mystr[] = "##this is##a examp#le";
puts(mystr);
puts(replace(mystr, '#', '*'));
return 0;
}

-Mike
 
N

Netocrat

Your code looked fine Mike but I gathered the OP wanted to deal more
generally with string rather than character replacement, so here's an
extended version.

Output:

##this is##a examp#le
****this is****a examp#le

Code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *replace(const char *s, const char *old, const char *new)
{
char *ret;
int i, count = 0;
size_t newlen = strlen(new);
size_t oldlen = strlen(old);

for (i = 0; s != '\0'; i++) {
if (strstr(&s, old) == &s) {
count++;
i += oldlen - 1;
}
}

ret = malloc(i + count * (newlen - oldlen));
if (ret == NULL)
exit(EXIT_FAILURE);

i = 0;
while (*s) {
if (strstr(s, old) == s) {
strcpy(&ret, new);
i += newlen;
s += oldlen;
} else
ret[i++] = *s++;
}
ret = '\0';

return ret;
}

int main(void)
{
char mystr[] = "##this is##a examp#le";
char *newstr = NULL;

puts(mystr);
newstr = replace(mystr, "##", "****");
printf("%s\n", newstr);

free(newstr);
return 0;
}
 
S

Skarmander

Netocrat wrote:
char *replace(const char *s, const char *old, const char *new)
{
ret = malloc(i + count * (newlen - oldlen));
if (ret == NULL)
exit(EXIT_FAILURE);

Whoa, what's this? Just return NULL. It's massively rude to terminate
the program in a function like this. It's *likely* the caller can't
respond to out-of-memory either (and won't check for it if they're
sloppy), but making the decision here is silly.

S.
 
N

Netocrat

Netocrat wrote:



Whoa, what's this? Just return NULL. It's massively rude to terminate
the program in a function like this. It's *likely* the caller can't
respond to out-of-memory either (and won't check for it if they're
sloppy), but making the decision here is silly.

In a generic library function I agree 100%. When it's part of a specific
simple application, a policy to abort on memory failure isn't
unreasonable. The possibility did occur to me, and the reason I didn't
take it up is that it would have added extra lines to the (demonstration)
code. OTOH, this is intended to be a portable function, so your objection
is sound.
 
S

Skarmander

Netocrat said:
In a generic library function I agree 100%. When it's part of a specific
simple application, a policy to abort on memory failure isn't
unreasonable. The possibility did occur to me, and the reason I didn't
take it up is that it would have added extra lines to the (demonstration)
code. OTOH, this is intended to be a portable function, so your objection
is sound.

The portability (that is, portability of a function across applications)
wasn't actually my primary concern; my primary concern was separation of
concerns. :) Side effects of a function should be minimized to those
that are documented. A function like `replace' has no business
terminating the program, unless it's clearly stated to do just that if
it can't allocate memory.

A "policy to abort on memory failure" is reasonable only if it really is
a policy, not something the function happens to do because it's
acceptable in the (implicit, unstated) context. You shouldn't have to be
required to know this. Doing this in `main' or a major subroutine
without drawing attention to it is fine, but not a string replacement
function, no matter what program or library it occurs in.

I'm not pretending I'm telling you something new, but especially since
this is demonstration code, I'm spelling it out for the innocent.
Encouraging good habits does pay off even if things become a little more
complicated to understand. Better that than writing code that's easy,
but contains a trap. We should at least point out the trap, then. As
we've just done. :)

S.
 
W

websnarf

Paul said:
char *mystr="##this is##a examp#le";

I want to replace all the "##" in mystr with "****". How can I do this?

I checked all the string functions in C, but did not find one.

That is because the C library doesn't have one. (BTW, The Better
String Library, http://bstring.sf.net/ does contain a function
"bfindreplace()" that does this.) You are also pointing mystr to a
constant string, which you cannot modify at all. So you would either
have to change the storage or allow the creation of new storage for
your desired result.

Ok, there is a big difference in general between the three cases of
expanding, shrinking, or having the same sized before and after
strings. The first may require additional storage, while the other two
never do. In this case you are expanding, so you have to solve the
problem of requiring additional storage for your result.

Also you have to define what you mean by "replace all". For example,
if you want to replace "##" with "***#", then what happens when you
apply it to "foo###bar"? Does it become "foo***##bar" or does it
become "foo******#bar"?
 
J

Jordan Abel

In a generic library function I agree 100%. When it's part of a
specific simple application, a policy to abort on memory failure
isn't unreasonable.

Then I'd do it in a malloc wrapper so that it'd be done everywhere.
 
D

Dave Thompson

char *replace(const char *s, const char *old, const char *new)
{
char *ret;
int i, count = 0;
size_t newlen = strlen(new);
size_t oldlen = strlen(old);

for (i = 0; s != '\0'; i++) {
if (strstr(&s, old) == &s) {
count++;
i += oldlen - 1;
}
}


This is wasting a good deal of work in strstr() finding (or excluding)
matches that you throw away. Either something like:
for( i = 0; s != '\0'; )
if( strncmp (&s, old, oldlen) == 0 ) ++count, i += oldlen;
else ++i;
/* or can probably use memcmp since in practice it will safely
stop on or near null != nonnull and return nonmatch, but not
AFAICT guaranteed, and being not str* may alarm some */
or something like:
const char * p = s;
while( (p = strstr (p, old)) != NULL )
++count, p += oldlen;
ret = malloc(i + count * (newlen - oldlen));

As already noted, + 1. For my strstr form, strlen(s).
if (ret == NULL)
exit(EXIT_FAILURE);

i = 0;
while (*s) {
if (strstr(s, old) == s) {

Similarly to above.
strcpy(&ret, new);


Could be memcpy ( , , newlen); you don't need a null terminator
each time through the loop as you do one below after the loop.
i += newlen;
s += oldlen;
} else
ret[i++] = *s++;
}
ret = '\0';


<snip>
- David.Thompson1 at worldnet.att.net
 
N

Netocrat

char *replace(const char *s, const char *old, const char *new)
{
char *ret;
int i, count = 0;
size_t newlen = strlen(new);
size_t oldlen = strlen(old);

for (i = 0; s != '\0'; i++) {
if (strstr(&s, old) == &s) {
count++;
i += oldlen - 1;
}
}


This is wasting a good deal of work in strstr() finding (or excluding)
matches that you throw away. Either something like:
for( i = 0; s != '\0'; )
if( strncmp (&s, old, oldlen) == 0 ) ++count, i += oldlen;
else ++i;


Yes that's the behaviour I intended; the loop evolved from something
like your second suggestion below (requiring a call to strlen) and I
neglected to replace strstr with a more optimal function.

Moving the loop increment into the else statement makes a lot of sense by
avoiding the subtraction of one when strstr (/strncmp/memcmp) succeeds.

The comma operator is helpful here to remove braces, although I prefer the
statements on a new indented line as it requires less eye horizontal eye
movement, more clearly separates the condition from the subsequent
statement and thus to me is more readable.
/* or can probably use memcmp since in practice it will safely
stop on or near null != nonnull and return nonmatch, but not
AFAICT guaranteed, and being not str* may alarm some */

You seem to be correct - "7.21.4.1 The memcmp function" doesn't require it
to stop as long as it ultimately returns the correct result - but it's in
the implementation's interests to make it as efficient as possible.

It looks more suitable to me since strncmp requires an additional test for
'\0' on every character comparison that in theory reduces its efficiency.
or something like:
const char * p = s;
while( (p = strstr (p, old)) != NULL )
++count, p += oldlen;


As already noted, + 1. For my strstr form, strlen(s).
^^^^^^^^^

For that reason I've preferred the strncmp (actually memcmp) version. The
whole point of the looping construct I used was to avoid calling strlen -
it seems to be a waste to iterate over the string twice. This could be
partly avoided by doing something like:

const char *last = s;
const char *sorg = s;
size_t orglen;

s = strstr(s, old);
while (s != NULL) {
count++;
s += oldlen;
last = s;
s = strstr(s, old);
}
orglen = last - &sorg[0] + strlen(last);
ret = malloc(orglen + 1 + count * (newlen - oldlen));

but this is still iterating redundantly over the string to determine
strlen(last).

Two arguments in favour of this approach are that firstly it's possible
that the library functions will be so much faster than explicit iteration
that despite the redundancy this approach wins.

Secondly if the final replacement is near the end of the string and
the library functions are - even slightly - faster, then it's also likely
to be a faster approach.

One would have to benchmark each implementation for typical strings and
replacements to see to what extent these hold.
Similarly to above.

Yes, again strncmp/memcmp are clearly preferable; and again modifying the
loop to use strstr less wastefully would introduce redundant iteration
over the string whilst copying, for which arguments similar to the above
apply.
strcpy(&ret, new);


Could be memcpy ( , , newlen); you don't need a null terminator
each time through the loop as you do one below after the loop.


Good call.
i += newlen;
s += oldlen;
} else
ret[i++] = *s++;
}
ret = '\0';


<snip>


Insightful comments as always Dave.

Something else that no one has pointed out yet is that the type of i and
count should be size_t rather than int.

One other minor optimisation I've made to occasionally save a few function
calls is to check whether newlen and oldlen differ and if not, simply
calculate the string's length without looking for matches.

The other minor optimisation is to use pointer arithmetic instead
of indexing in the second loop. Whether this is any more optimal in
practice depends on the compiler/hardware but it's more optimal on the
generic abstract machine I tend to envisage.

This doesn't apply to the first loop since "i" must exist regardless -
by 6.5.6#9 if the result of subtracting two pointers does not fit in a
ptrdiff_t type the behaviour is undefined.

Here's a fixed-up version incorporating all fixes suggested in the thread
so far:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/* Preconditions:
* s, old and new are valid non-NULL pointers
* strlen(old) >= 1
*/
char *replace(const char *s, const char *old, const char *new)
{
char *ret, *sr;
size_t i, count = 0;
size_t newlen = strlen(new);
size_t oldlen = strlen(old);

if (newlen != oldlen) {
for (i = 0; s != '\0'; ) {
if (memcmp(&s, old, oldlen) == 0)
count++, i += oldlen;
else
i++;
}
} else
i = strlen(s);

ret = malloc(i + 1 + count * (newlen - oldlen));
if (ret == NULL)
return NULL;

sr = ret;
while (*s) {
if (memcmp(s, old, oldlen) == 0) {
memcpy(sr, new, newlen);
sr += newlen;
s += oldlen;
} else
*sr++ = *s++;
}
*sr = '\0';

return ret;
}

int main(void)
{
char mystr[] = "##this is##an examp#le";
char *newstr = NULL;

puts(mystr);

newstr = replace(mystr, "##", "****");
if (newstr == NULL)
puts("replace returned NULL");
else
puts(newstr);

free(newstr);
return 0;
}
 
S

Skarmander

Netocrat wrote:
The other minor optimisation is to use pointer arithmetic instead
of indexing in the second loop. Whether this is any more optimal in
practice depends on the compiler/hardware but it's more optimal on the
generic abstract machine I tend to envisage.

sr = ret;
while (*s) {
if (memcmp(s, old, oldlen) == 0) {
memcpy(sr, new, newlen);
sr += newlen;
s += oldlen;
} else
*sr++ = *s++;
}
*sr = '\0';
The way you've written it, it actually matters far more how memcmp() is
implemented and how fast repeated single-byte copying is.

Alternate version that does not operate on individual bytes and which
may or may not be faster (but likely is):

const char* t;

sr = ret;
while ((t = strstr(s, old)) != NULL) {
/* Copy non-matching prefix */
ptrdiff_t l = t - s;
memcpy(sr, s, l);
sr += l;
s += l;

/* Copy replacement text */
memcpy(sr, new, newlen);
sr += newlen;
s += oldlen;
}
/* Copy non-matching tail */
strcpy(sr, s);

AFAICT the standard doesn't actually guarantee a (non-negative)
ptrdiff_t will fit in a size_t, but this does seem to be a fairly sane
assumption in this case. Even so, on implementations where ptrdiff_t is
very small and your strings are very large (or pointer arithmetic is
just plain inefficient), this may not work. You could always write a
custom strstr() that returns the offset as a size_t.

S.
 
N

Netocrat

Netocrat wrote:

Aside from the fact that indexing is not possible using the alternative
you provide, this quote seems to be unrelated to the rest of your post,
which is really about library functions vs manual iteration, not indexing
vs pointer arithmetic.
The way you've written it, it actually matters far more how memcmp() is
implemented and how fast repeated single-byte copying is.

Alternate version that does not operate on individual bytes and which
may or may not be faster (but likely is):

That's the sort of code I had in mind when I wrote:
[restored snippage]
IOW one reason I decided against this implementation is that it iterates
over the string redundantly - once for strstr to find a match and once for
memcpy to copy the searched-over part of the string. The relevant
argument that I was referencing is:
[restored snippage]
So we agree - it may or may not be faster. I'll grant that for large
strings it may well be faster, because the implementation can likely
make use of implementation-specific hardware to copy large blocks of
memory. However (and this is the second reason that I decided against
this approach) for large strings there is the problem that...
const char* t;

sr = ret;
while ((t = strstr(s, old)) != NULL) {
/* Copy non-matching prefix */
ptrdiff_t l = t - s;
memcpy(sr, s, l);
sr += l;
s += l;

/* Copy replacement text */
memcpy(sr, new, newlen);
sr += newlen;
s += oldlen;
}
/* Copy non-matching tail */
strcpy(sr, s);

AFAICT the standard doesn't actually guarantee a (non-negative)
ptrdiff_t will fit in a size_t, but this does seem to be a fairly sane

....the standard doesn't even guarantee that taking the difference of two
pointers is defined. Restoring more snippage:
assumption in this case. Even so, on implementations where ptrdiff_t is
very small and your strings are very large (or pointer arithmetic is
just plain inefficient), this may not work. You could always write a
custom strstr() that returns the offset as a size_t.

Coming full-circle, such a custom strstr would require the use of indexing...

However it wouldn't have the advantage of being a library function and in
that case I _won't_ grant much likelihood that this approach will be
faster - the redundancy would not be offset by library function speed.

Another possibility were one convinced that one's library string functions
were genuinely fast enough to warrant redundant iteration over the string
would be to pass in a non-const qualified string, and split it into
segments of size PTRDIFF_MAX, saving the character overwritten by the '\0'
terminator and restoring it after processing the segment.

This also requires some special-case handling for the possibility of a
match running across a segment. Something like this, although as a
single function, replace is starting to get too long for my liking - even
though I've moved the initial buffer creation code into a separate
function:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdint.h>
#include <assert.h>

/* Preconditions:
* s and old are valid non-NULL pointers
* strlen(old) == oldlen
*/
char *getreplbuff(const char *s, const char *old, size_t oldlen,
size_t newlen, size_t *len)
{
size_t i, count = 0;

if (newlen != oldlen) {
for (i = 0; s != '\0'; ) {
if (memcmp(&s, old, oldlen) == 0)
count++, i += oldlen;
else
i++;
}
*len = i;
} else
*len = strlen(s);

return malloc(*len + 1 + count * (newlen - oldlen));
}

/* Preconditions:
* s, old and new are valid non-NULL pointers
* strlen(old) >= 1
* strlen(old) <= PTRDIFF_MAX
*/
char *replace(char *s, const char *old, const char *new)
{
char *ret, *sr, *segstart = s;
const char *t, *last;
size_t len, newlen = strlen(new);
size_t trailchars, oldlen = strlen(old);

assert(strlen(old) <= PTRDIFF_MAX); /* inf loop is possible if this fails */

if ((ret = getreplbuff(s, old, oldlen, newlen, &len)) == NULL)
return NULL;

sr = ret;
for (;;) {
char restore;

if (len > PTRDIFF_MAX) {
/* Oversize string - create a segment */
restore = s[PTRDIFF_MAX];
s[PTRDIFF_MAX] = '\0';
last = NULL;
}

while ((t = strstr(s, old)) != NULL) {
last = t;

/* Copy non-matching prefix */
ptrdiff_t l = t - s;
memcpy(sr, s, l);
sr += l;
s += l;

/* Copy replacement text */
memcpy(sr, new, newlen);
sr += newlen;
s += oldlen;
}
/* Copy non-matching tail */
strcpy(sr, s);

/* Quit when there are no more segments */
if (len <= PTRDIFF_MAX)
break;

/* Increment dest string pointer by length of non-matching tail */
sr += segstart + PTRDIFF_MAX - s;

/* Restore original char at segment termination */
segstart[PTRDIFF_MAX] = restore;

/* Calculate the minimum number of char posns that we need to
* decrement the start of the next segment by to ensure that
* we don't miss matches */
if (last == NULL || ((trailchars =
(&segstart[PTRDIFF_MAX] - last) - oldlen) >= oldlen))
trailchars = oldlen - 1;


/* Set the start of the next segment */
s = &segstart[PTRDIFF_MAX] - trailchars;
segstart = s;

/* Decrement the dest str pointer by the same amount as the segment */
sr -= trailchars;

/* Update the number of chars unchecked in the original string */
len -= PTRDIFF_MAX - trailchars;
}

return ret;
}

int main(void)
{
char mystr[] = "##this is##an examp#le";
char *newstr = NULL;

puts(mystr);

newstr = replace(mystr, "#", "****");
if (newstr == NULL)
puts("replace returned NULL");
else
puts(newstr);

free(newstr);
return 0;
}
 
S

Skarmander

Netocrat said:
Aside from the fact that indexing is not possible using the alternative
you provide, this quote seems to be unrelated to the rest of your post,
which is really about library functions vs manual iteration, not indexing
vs pointer arithmetic.
Yes. I probably highlighted the wrong bit. Not to worry, right bits
coming up.
The way you've written it, it actually matters far more how memcmp() is
implemented and how fast repeated single-byte copying is.

Alternate version that does not operate on individual bytes and which
may or may not be faster (but likely is):


That's the sort of code I had in mind when I wrote:
[restored snippage]


IOW one reason I decided against this implementation is that it iterates
over the string redundantly - once for strstr to find a match and once for
memcpy to copy the searched-over part of the string.

This conveniently ignores the fact that you issue a call to memcmp() for
every byte you're iterating over. Of course this call (or macro, or
inline function, or intrinsic, whatever) will on average terminate
immediately after comparing the first character, but the overhead for
this is not factored in.

The iteration of my version is no more redundant than the fact that you
have do a memcmp() on the first byte, then a copy if it doesn't match
(that's two reads, not necessarily optimizable into one, depending on
how memcmp() works). Arguments for why two iterations of low-level
functions may have *less* overhead than doing two things in one
high-level iteration coming up.
The relevant argument that I was referencing is: [restored snippage]


So we agree - it may or may not be faster. I'll grant that for large
strings it may well be faster, because the implementation can likely
make use of implementation-specific hardware to copy large blocks of
memory.

For small strings it may well be faster too, because the cache then
becomes your friend. Copying blocks is a *very* efficient procedure on
most implementations. Compare-and-copy-if-not-equal-repeat, as expressed
in C, typically is not.

The interesting aspect of this is that I didn't see the redundancy at
all, because I don't think in terms of how strstr() and memcpy() are
implemented. Technically, yes, bytes will have to be read twice, once
for comparing, once for copying. This doesn't inspire me to write a copy
loop in C, however. I'd go straight to assembler if I thought that was
an issue.
However (and this is the second reason that I decided against
this approach) for large strings there is the problem that...




...the standard doesn't even guarantee that taking the difference of two
pointers is defined. Restoring more snippage:
Hold on, I did read it, but let's not exaggerate this. Of course the
standard doesn't guarantee it, but there will be some meaningful range
of values for which it holds. Otherwise there would be absolutely no
point to pointer arithmetic ever, no?

Please note that ptrdiff_t has to be capable of holding values of at
least 65535. Yes, conceivably, you could run into trouble if matches are
over 64K bytes away. But I for one would not switch to manually copying
bytes because of this potential restriction, unless I were made aware of
an *actual* restriction.

The restriction in this case is having a ptrdiff_t which is much smaller
than your size_t, and strings with matches that are more than ptrdiff_t
bytes apart. There will undoubtedly be applications on DeathStation 9000
models that have just this (and maybe some 8086 memory models,
actually), and for those my code might not work -- granted. So the 100%
portability prize won't be going to me this year, too bad. :)
Coming full-circle, such a custom strstr would require the use of indexing...

However it wouldn't have the advantage of being a library function and in
that case I _won't_ grant much likelihood that this approach will be
faster - the redundancy would not be offset by library function speed.
You're still underestimating memcpy(). Though I agree you'd get a big
hit on a custom strstr(). While writing such a strstr() in assembly
would be easy, this defeats the point of portable code.

It's actually unfriendly of the standard library not to give us mem*()
functions that can operate on pointer segments, or str*() functions that
return indices. There seems to be no good reason for this impedance
mismatch, other than the historical "oh, everything ought to fit in an
int, don't worry" assumption.
Another possibility were one convinced that one's library string functions
were genuinely fast enough to warrant redundant iteration over the string
would be to pass in a non-const qualified string, and split it into
segments of size PTRDIFF_MAX, saving the character overwritten by the '\0'
terminator and restoring it after processing the segment.
<snip code>

OK, now, I agree, this goes too far. This reaches the point of "not
worth it". *If* things are such that ptrdiff_t isn't large enough, a
custom strstr() should be used. If that slows things down too much, the
byte-copying variant should be used. This code is too complex, and
needlessly constrained on platforms that have no ptrdiff_t problem.

S.
 
N

Netocrat

Netocrat said:
Netocrat wrote:
[...]
[O]ne reason I decided against [an implementation of a string
replacement loop using strstr and strcpy] is that it iterates over the
string redundantly - once for strstr to find a match and once for
memcpy to copy the searched-over part of the string.

This conveniently ignores the fact that you issue a call to memcmp() for
every byte you're iterating over.

Yes, if you look at it like that, the first version has redundant
operations too.

[...]
For small strings it may well be faster too, because the cache then
becomes your friend. Copying blocks is a *very* efficient procedure on
most implementations. Compare-and-copy-if-not-equal-repeat, as expressed
in C, typically is not.

If I get inspired I'll do some benchmarking and post results.

[...]
Hold on, I did read it, but let's not exaggerate this. Of course the
standard doesn't guarantee it, but there will be some meaningful range
of values for which it holds. Otherwise there would be absolutely no
point to pointer arithmetic ever, no?

Sure, it just means you can't rely on it for objects of arbitrary size
unless you do a check that PTRDIFF_MAX >= SIZE_MAX (I know I'm not telling
you anything new, just clarifying it). It's possible for static objects
to be larger than SIZE_MAX but a program containing such objects is not
strictly conforming (it exceeds an environmental limit).

[...]
It's actually unfriendly of the standard library not to give us mem*()
functions that can operate on pointer segments,

I'm not sure what you mean by a pointer segment.
or str*() functions that return indices.

Sometimes such as for this function that would be useful, yes.

[a string replacement function variant that splits the non-const qualified
source string into segments of PTRDIFF_MAX]
OK, now, I agree, this goes too far. This reaches the point of "not
worth it". *If* things are such that ptrdiff_t isn't large enough, a
custom strstr() should be used.

Unless the constraint against const qualified strings and string literals
were a problem, I'd prefer the segmented function to a custom strstr
since, as you accept, that one is subject to potential performance
reduction that the segmented one isn't.
If that slows things down too much, the byte-copying variant should be
used. This code is too complex, and needlessly constrained on platforms
that have no ptrdiff_t problem.

It is complex although the final result is readable (by me at least) and
understandable by others through commenting (I hope).
 
M

Mark F. Haigh

Netocrat said:
Your code looked fine Mike but I gathered the OP wanted to deal more
generally with string rather than character replacement, so here's an
extended version.

There are a couple of problems here. First, this code will malfunction
on many 16 bit platforms if you pass it strings with length between 32K
and 64K (ie INT_MAX < length < SIZE_MAX). Your signed 'i' variable
will overflow (undefined behavior), and most likely wrap to -32768,
which would cause the subscripting to access memory it should not be
meddling with.

Second, this code performs poorly even though it may look at a glance
to be somewhat optimized. The first red flag is the extensive use of
subscripting. The second red flag is the unusual modification of a
loop variable ('i') within a loop, even though it is also modified by
the loop ('i++'). The third is the unusual use of an equality operator
(==) with strstr(), and the fourth is the character-by-character
parsing (which can be quite slow).
Output:

##this is##a examp#le
****this is****a examp#le

Code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *replace(const char *s, const char *old, const char *new)
{
char *ret;
int i, count = 0;
size_t newlen = strlen(new);
size_t oldlen = strlen(old);

for (i = 0; s != '\0'; i++) {
if (strstr(&s, old) == &s) {
count++;
i += oldlen - 1;
}
}


Note that strstr() can be called for nearly every character.
ret = malloc(i + count * (newlen - oldlen));
if (ret == NULL)
exit(EXIT_FAILURE);

As other posters have noted, just return NULL to the caller.
i = 0;
while (*s) {
if (strstr(s, old) == s) {
strcpy(&ret, new);
i += newlen;
s += oldlen;
} else
ret[i++] = *s++;
}
ret = '\0';


Again, strstr() can be called for nearly every character.
return ret;
}
<snip>

To give an idea of the impact of string parsing missteps can make on
program performance, I fed "Much Ado About Nothing" (the official play
of comp.lang.c) through your replace function, then through mine. The
entire file (121K) was loaded into a buffer, then fed 10 times to the
replace function as such:

/* ... */
fread(str, 1, size, fp);
str[size] = '\0';
for(i = 0; i < 10; i++) {
newstr = replace(str, "DON PEDRO", "NETOCRAT");
free(newstr);
}
/* ... */

The somewhat naieve implementation I whipped up quickly is:

char *replace(const char *str, const char *old, const char *new)
{
char *ret, *r;
const char *p, *q;
size_t len_str = strlen(str);
size_t len_old = strlen(old);
size_t len_new = strlen(new);
size_t count;

for(count = 0, p = str; (p = strstr(p, old)); p += len_old)
count++;

ret = malloc(count * (len_new - len_old) + len_str + 1);
if(!ret)
return NULL;

for(r = ret, p = str; (q = strstr(p, old)); p = q + len_old) {
count = q - p;
memcpy(r, p, count);
r += count;
strcpy(r, new);
r += len_new;
}
strcpy(r, p);
return ret;
}

Here are the timings on my machine. Both were compiled with the same
flags on gcc 4 (-Wall -O2 -ansi -pedantic)

[mark@icepick ~]$ time ./netocrat

real 0m39.901s
user 0m39.877s
sys 0m0.008s

[mark@icepick ~]$ time ./mark

real 0m0.033s
user 0m0.026s
sys 0m0.007s

Even though the somewhat naieve program traverses the input string
multiple times, it's still a 99.9% execution time decrease.


Mark F. Haigh
(e-mail address removed)
 
N

Netocrat

Netocrat wrote: [...]
There are a couple of problems here. First, [i should be size_t not int].

Well spotted - no one else caught that one (I only picked it up after Dave
Thompson's critique).
Second, this code performs poorly even though it may look at a glance to
be somewhat optimized.

Yes; I've since posted code that's an improvement - it doesn't use
subscripting in the replacement loop and it uses memcmp rather than
strstr - but I realise that both you and Skarmander are correct that
character-by-character operations should be avoided. As Skarmander points
out my attempt to avoid redundant iteration did not in fact do that since
memcmp is called on every character.

[...]
To give an idea of the impact of string parsing missteps can make on
program performance, I fed "Much Ado About Nothing" (the official play
of comp.lang.c) through your replace function, then through mine. The
entire file (121K) was loaded into a buffer, then fed 10 times to the
replace function as such:

/* ... */
fread(str, 1, size, fp);
str[size] = '\0';
for(i = 0; i < 10; i++) {
newstr = replace(str, "DON PEDRO", "NETOCRAT"); free(newstr);
}
/* ... */

I've written a benchmark program, which I've made available here:
http://members.dodo.com.au/~netocrat/c/source/replacebench.c

It incorporates my fixed-up function, your function below and Skarmander's
suggested snippet (which has an identical algorithm to yours although the
expression is slightly different, and it uses memcpy where you've used
strcpy). It also incorporates an optimised version which avoids calling
strlen on the entire original string where possible - I posted a similar
but less readable snippet in response to Dave T's post and as you and S
have pointed out, it's a better approach.

Inspired by the appropriateness of the play, I downloaded a copy. Here
are some typical results (using the same compilation options as you except
with -std=c99 and gcc version 3.3.6):

Filename : /data/doc-dl/literature/Much_Ado_About_Nothing.txt
Iterations : 1000
Old : Leonato
New : Haigh
Function : replace_net
Pre-length : 139007
Post-length: 138883
Duration : 17.280000 seconds

Filename : /data/doc-dl/literature/Much_Ado_About_Nothing.txt
Iterations : 1000
Old : Leonato
New : Haigh
Function : replace_haigh
Pre-length : 139007
Post-length: 138883
Duration : 1.250000 seconds

Filename : /data/doc-dl/literature/Much_Ado_About_Nothing.txt
Iterations : 1000
Old : Leonato
New : Haigh
Function : replace_opt
Pre-length : 139007
Post-length: 138883
Duration : 1.160000 seconds

The fixed-up version of my original function is still about 15 times
slower, but it's a lot better than it was.

The optimisation of not calling strlen on the entire string cuts about 7%
off the running time.
The somewhat naieve implementation I whipped up quickly is:

I like the clarity with which you've used looping expressions.
char *replace(const char *str, const char *old, const char *new) {
char *ret, *r;
const char *p, *q;
size_t len_str = strlen(str);
^^^^^^^^^^^^^^

This can be avoided (I know you've already qualified this as a somewhat
naive implementation).
size_t len_old = strlen(old);
size_t len_new = strlen(new);
size_t count;

If len_old == len_new, there's no need to run this loop and you can simply
malloc len_str + 1 bytes.
for(count = 0, p = str; (p = strstr(p, old)); p += len_old)
count++;

ret = malloc(count * (len_new - len_old) + len_str + 1);
if(!ret)
return NULL;

for(r = ret, p = str; (q = strstr(p, old)); p = q + len_old) {
count = q - p;

As discussed elsethread, under C99 this invokes undefined behaviour when q
- p > PTRDIFF_MAX. I vaguely recall discussions on the semantics under
C89, but can't recall the conclusion. Also under C99 count would be
better declared ptrdiff_t but it appears that you're writing to C89.
memcpy(r, p, count);
r += count;
strcpy(r, new);

As Dave T pointed out memcpy is more appropriate here; but as compiled
under gcc on my platform strcpy's very slightly faster than using memcpy -
go figure.
r += len_new;
}
strcpy(r, p);
return ret;
}

Here are the timings on my machine. Both were compiled with the same
flags on gcc 4 (-Wall -O2 -ansi -pedantic) [results omitted]
Even though the somewhat naieve program traverses the input string
multiple times, it's still a 99.9% execution time decrease.

Here's the optimised C99 version of the function used for the benchmark
results above. The #error seems to me to be the best way of dealing with
the possibility of UB - it alerts the programmer to it at compile time and
encourages him/her to examine the comments in the code whilst removing it.

#include <string.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdint.h>

# if PTRDIFF_MAX < SIZE_MAX
/* If the following line causes compilation to fail,
* comment it out after taking note of its message.
*/
# error The replace function might invoke undefined \
behaviour for strings with length greater than \
PTRDIFF_MAX - see comments in the function body
# endif

char *replace(const char *str, const char *old, const char *new)
{
char *ret, *r;
const char *p, *q;
size_t oldlen = strlen(old);
size_t count, retlen, newlen = strlen(new);

if (oldlen != newlen) {
for (count = 0, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen)
count++;
/* this is undefined if p - str > PTRDIFF_MAX */
retlen = p - str + strlen(p) + count * (newlen - oldlen);
} else
retlen = strlen(str);

ret = malloc(retlen + 1);

for (r = ret, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen) {
/* this is undefined if q - p > PTRDIFF_MAX */
ptrdiff_t l = q - p;
memcpy(r, p, l);
r += l;
memcpy(r, new, newlen);
r += newlen;
}
strcpy(r, p);

return ret;
}
 
P

pete

Netocrat said:
As discussed elsethread,
under C99 this invokes undefined behaviour when q - p > PTRDIFF_MAX.
I vaguely recall discussions on the semantics under
C89, but can't recall the conclusion. Also under C99 count would be
better declared ptrdiff_t but it appears that you're writing to C89.

Whether or not to declare count ptrdiff_t,
is not really an 89 vs 99 issue.
The type of the expression, (q - p), is ptrdiff_t.
If there's going to be undefined behavior,
it's going to be from that expression.

While there is no PTRDIFF_MAX macro in C89,
there is still a maximum limit to the ptrdiff_t type,
there's just no portable way to tell what is.
I suppose you can assume that it's at least 127.
 
N

Netocrat

On Thu, 10 Nov 2005 13:12:40 +0000, pete wrote:

[when is the positive result of subtracting pointers UB in C89?]
While there is no PTRDIFF_MAX macro in C89,
there is still a maximum limit to the ptrdiff_t type,
there's just no portable way to tell what is.
I suppose you can assume that it's at least 127.

That's clarified things. I don't have access to a C89 draft right now and
wasn't sure that the ptrdiff_t type even existed in C89. So for C89 it's
not possible to perform a specific compile-time check for UB. It seems
like the best way to go about it is this:

#if (__STDC_VERSION__ >= 199901L)
#include <stdint.h>
#if PTRDIFF_MAX < SIZE_MAX
# error Under C99 and above the replace function might \
invoke undefined behaviour for strings with length greater \
than PTRDIFF_MAX - see comments in the function body
#endif
#else
#ifdef __STDC__
# error Under C89 the replace function might invoke \
undefined behaviour for strings with length greater than the \
maximum positive value of the ptrdiff_t type, but there is no \
portable way to determine this value (check your compiler \
documentation) - see comments in the function body
#endif
#endif
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top