Replacing a word in a string

J

jacob navia

Hi guys!

I like C because is fun. So, I wrote this function for the lcc-win32
standard library: strrepl.

I thought that with so many "C heads" around, maybe we could improve it
in a collective brainstorming session.

Let's discuss some C here, for a change :)

Specs:
-----
Function: strrepl

Synopsis

#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);
Description

The strrepl function replaces in InputString all occurrences of
StringToFind by Replacement, writing the modified contents into the
Output string. The original input string is not modified.

If the Output argument is NULL, strrepl will return the space that would
be needed (including the terminating zero) for the replacement.

If Replacement is NULL and Output is not NULL, all occurrences of
StringToFind will be erased.

Returns

The strrepl function returns the needed length for the replacements if
its Output argument is NULL. If not, it returns the number of
replacements done.

Code:
-----
#include <string.h>
int strrepl(char *InputString,char *StringToFind,
char *StringToReplace,char *output)
{
char *offset = NULL, *CurrentPointer = NULL;
int insertlen;
int findlen = strlen(StringToFind);
int result = 0;

if (StringToReplace)
insertlen = strlen(StringToReplace);
else
insertlen = 0;
if (output) {
if (output != InputString)
memmove(output,InputString,strlen(InputString)+1);
InputString = output;
}
else
result = strlen(InputString)+1;

while (*InputString) {
offset = strstr(!offset?InputString:CurrentPointer,StringToFind);
if (offset == NULL)
break;
CurrentPointer = (offset + (output ? insertlen : findlen));
if (output) {
strcpy (offset, (offset + findlen));
memmove (offset + insertlen,
offset, strlen (offset) + 1);
if (insertlen)
memcpy (offset, StringToReplace, insertlen);
result++;
}
else {
result -= findlen;
result += insertlen;
}
}
return result;
}

All kinds of comments are welcome, regarding the code, the interface
design, the documentation, etc etc.

jacob
 
E

Eric Sosman

jacob navia wrote On 12/06/05 15:41,:
Synopsis

#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);
[...]

Code:

A diagnostic is required if both the declaration
and the definition appear in the same translation unit.
(In C, anyhow. That other language you're fond of may
have different rules.)
 
B

Ben Pfaff

Eric Sosman said:
jacob navia wrote On 12/06/05 15:41,:
#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);

[...]

int strrepl(char *InputString,char *StringToFind,
char *StringToReplace,char *output)

A diagnostic is required if both the declaration
and the definition appear in the same translation unit.

What? Have you never heard of a "function prototype"? They've
been all the rage since 1989 or so. Even before that, function
declarations before definitions were, I believe, allowed.

The real problem here is the differing return types.
 
M

Michael Mair

jacob said:
Hi guys!

I like C because is fun. So, I wrote this function for the lcc-win32
standard library: strrepl.

I thought that with so many "C heads" around, maybe we could improve it
in a collective brainstorming session.

Let's discuss some C here, for a change :)

Specs:
-----
Function: strrepl

Synopsis

#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
^^^^^^
ITYM: int or size_t
char *Replacement, char *Output);

In order to avoid name clashes, I'd rather call the function
str_repl or strRepl or replaceSubstring.
Description

The strrepl function replaces in InputString all occurrences of
StringToFind by Replacement, writing the modified contents into the
Output string. The original input string is not modified.

Then your prototype is wrong: You want
size_t str_repl(const char* restrict InputString,
const char* StringToFind,
const char* Replacement,
char* restrict Output);
to avoid this. Or do away with the restrict if you want
to express that the original string is not modified via
InputString.
If the Output argument is NULL, strrepl will return the space that would
be needed (including the terminating zero) for the replacement.

If Replacement is NULL and Output is not NULL, all occurrences of
StringToFind will be erased.

Returns

The strrepl function returns the needed length for the replacements if
its Output argument is NULL. If not, it returns the number of
replacements done.

I would rather like a size parameter as in snprintf() so you
can give the size of the storage output points to.
All kinds of comments are welcome, regarding the code, the interface
design, the documentation, etc etc.

- I would rather aim for a standard order of the parameters,
say leading
destination, size,
followed by either
1) replacestr, source, findstr
or
2) source, replacestr, findstr

1) has the advantage that all parameters which can have special
values are gathered to the start.

- I would rather return size_t as strlen() returns size_t

- I would not "overload" the return value semantics depending
on destination -- the user cannot detect errors this way.
I'd rather settle for returning the number of bytes needed for
the "fully replaced" destination string and returning 0 on
error.
If the number of replacements is needed, use an additional
size_t* parameter.

- give a str_n_repl variant allowing to restrict the number of
replacements.


Cheers
Michael
 
R

Robert Gamble

Ben said:
Eric Sosman said:
jacob navia wrote On 12/06/05 15:41,:
#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);

[...]

int strrepl(char *InputString,char *StringToFind,
char *StringToReplace,char *output)

A diagnostic is required if both the declaration
and the definition appear in the same translation unit.

What? Have you never heard of a "function prototype"? They've
been all the rage since 1989 or so. Even before that, function
declarations before definitions were, I believe, allowed.

The real problem here is the differing return types.

I think it is pretty safe to assume that he was referring to the
declaration and definition that the OP presented.

Robert Gamble
 
B

buda

Michael Mair said:
I would rather like a size parameter as in snprintf() so you
can give the size of the storage output points to.

I think that wouldn't necessarily make things any better. As I imagine, you
would rarely know how much storage could be needed by the function in
advance without calling it first with NULL as the Output argument. Also, if
you knew that, for example, strlen(StringToFind)>=strlen(Replacement) then
passing the size of storage that Output points to is pointless (since you
know how much you need, you just allocate it and that's it). While I like
the overall idea, I find this size issue very inpractical and error prone (I
imagine this is one of the reasons something like this isn't in the standard
library in the first place - but there are probably many reasons far more
important than this). It might be better for str_repl to allocate the needed
space and return a pointer to it (with the proper output written to it
ofcourse).
 
J

jacob navia

jacob said:
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);

OOOOPS
That should have been:

int strrepl (...)
^^^
of course!

Excuse me for this error.
 
J

jacob navia

Robert said:
I think it is pretty safe to assume that he was referring to the
declaration and definition that the OP presented.

Robert Gamble

Yes, I think Eric was misled by a confusion in the declaration:
it is int strrepl, not char *strrepl, as I mistakenly wrote.

Excuse me for this error.
jacob
 
J

jacob navia

Michael said:
^^^^^^
ITYM: int or size_t



In order to avoid name clashes, I'd rather call the function
str_repl or strRepl or replaceSubstring.



Then your prototype is wrong: You want
size_t str_repl(const char* restrict InputString,
const char* StringToFind,
const char* Replacement,
char* restrict Output);
to avoid this. Or do away with the restrict if you want
to express that the original string is not modified via
InputString.



I would rather like a size parameter as in snprintf() so you
can give the size of the storage output points to.




- I would rather aim for a standard order of the parameters,
say leading
destination, size,
followed by either
1) replacestr, source, findstr
or
2) source, replacestr, findstr

1) has the advantage that all parameters which can have special
values are gathered to the start.

- I would rather return size_t as strlen() returns size_t

Yes, this is a good remark.

- I would not "overload" the return value semantics depending
on destination -- the user cannot detect errors this way.
I'd rather settle for returning the number of bytes needed for
the "fully replaced" destination string and returning 0 on
error.

Yes, but how would the user know how much to allocate?
The problem is that the user can't know how long the result string
will be.
If the number of replacements is needed, use an additional
size_t* parameter.

I think the number of replacements is only useful for knowing
if there was any replaceements at all, i.e; != 0
- give a str_n_repl variant allowing to restrict the number of
replacements.


Cheers
Michael

Thanks for your input.
 
F

Flash Gordon

jacob said:
Yes, but how would the user know how much to allocate?
The problem is that the user can't know how long the result string
will be.

The user can always know the upper bound by knowing the length of the
input string, length of search string and length of replacement string
then assuming if the replacement is longer than the search that the
input string consists entirely of strings to be replaced.
I think the number of replacements is only useful for knowing
if there was any replaceements at all, i.e; != 0

A counter example: implementing a text editor doing a search and replace
that tells the user how many instances have been replaced. I've used
such text editors and it is a nice thing to see.

I agree with Michael that the string length and count of items replaced
should be seperate returned values.

The main use I see for that is to replace only the first instance.

I think being able to specify the buffer size would be good.
 
J

Joe Wright

jacob said:
Hi guys!

I like C because is fun. So, I wrote this function for the lcc-win32
standard library: strrepl.

I thought that with so many "C heads" around, maybe we could improve it
in a collective brainstorming session.

Let's discuss some C here, for a change :)

Specs:
-----
Function: strrepl

Synopsis

#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);
Description

The strrepl function replaces in InputString all occurrences of
StringToFind by Replacement, writing the modified contents into the
Output string. The original input string is not modified.

If the Output argument is NULL, strrepl will return the space that would
be needed (including the terminating zero) for the replacement.

If Replacement is NULL and Output is not NULL, all occurrences of
StringToFind will be erased.

Returns

The strrepl function returns the needed length for the replacements if
its Output argument is NULL. If not, it returns the number of
replacements done.

Code:
-----
#include <string.h>
int strrepl(char *InputString,char *StringToFind,
char *StringToReplace,char *output)
{
char *offset = NULL, *CurrentPointer = NULL;
int insertlen;
int findlen = strlen(StringToFind);
int result = 0;

if (StringToReplace)
insertlen = strlen(StringToReplace);
else
insertlen = 0;
if (output) {
if (output != InputString)
memmove(output,InputString,strlen(InputString)+1);
InputString = output;
}
else
result = strlen(InputString)+1;

while (*InputString) {
offset = strstr(!offset?InputString:CurrentPointer,StringToFind);
if (offset == NULL)
break;
CurrentPointer = (offset + (output ? insertlen : findlen));
if (output) {
strcpy (offset, (offset + findlen));
memmove (offset + insertlen,
offset, strlen (offset) + 1);
if (insertlen)
memcpy (offset, StringToReplace, insertlen);
result++;
}
else {
result -= findlen;
result += insertlen;
}
}
return result;
}

All kinds of comments are welcome, regarding the code, the interface
design, the documentation, etc etc.

jacob

I'm late here and I apologize. Thank you Jacob for bringing up the
subject. Clipper and FoxPro (which pay my current rent) have several
fairly rich string functions. I have determined to 'copy' them to C.
This (stuff) is one of them. In xBASE we would simply assign the value
to a variable..

var1 = "Now is the time!"
var2 = stuff(var1, 8, 0, "NOT ")

...but in C we can't assign strings so I have determined to use the
destination/source construct of C string functions.

/*
Allows 0 or more charcters to be removed from a string at a
given point and 0 or more characters to be inserted at that point.
*/
char *stuff(char *dst, char *src, int beg, int rem, char *ins) {
char tmp[256];
int i, j, k;
for (i = 0; i < beg; ++i)
tmp = src; /* Copy the 'head' of src to tmp */
j = i + rem; /* Index for 'tail' of src */
for (k = 0; ins[k];)
tmp[i++] = ins[k++];
for (; src[j];)
tmp[i++] = src[j++];
tmp = 0;
return strcpy(dst, tmp);
}
 
S

slebetman

jacob said:
Yes, but how would the user know how much to allocate?
The problem is that the user can't know how long the result string
will be.

I would rather not depend on the user to allocate the memory. If it
were me, and I've done it before, I would let the function realloc() as
necessary. Of course, this is not typical C library behavior. But I
like to make my string functions behave like scripting languages as
much as possible.

The only problem with this kind of implementation is that this may
modify the pointer to the string so all other references to the string
will be invalid after the function call. And of course, realloc() may
fail so the function must also return a fail value when realloc()
fails.
 
M

Michael Mair

buda said:
I think that wouldn't necessarily make things any better. As I imagine, you
would rarely know how much storage could be needed by the function in
advance without calling it first with NULL as the Output argument. Also, if
you knew that, for example, strlen(StringToFind)>=strlen(Replacement) then
passing the size of storage that Output points to is pointless (since you
know how much you need, you just allocate it and that's it). While I like
the overall idea, I find this size issue very inpractical and error prone (I
imagine this is one of the reasons something like this isn't in the standard
library in the first place - but there are probably many reasons far more
important than this). It might be better for str_repl to allocate the needed
space and return a pointer to it (with the proper output written to it
ofcourse).

outsize = str_repl(NULL, 0, replace, source, find, NULL);
if (outsize)
{
char* outbuf = malloc(outsize);
if (outbuf) {
outsize = str_repl(outbuf, outsize, replace, source, find, &numrep);
....
}
....
}
.....

Same as snprintf() but easier to use due to the size_t return of
the necessary number of bytes.

If you want, you can of course make an allocating variant, say
astr_repl(&outbuf, outsize, ....)


Cheers
Michael
 
M

Michael Mair

I expected some comment here.
Yes, this is a good remark.

Unfortunately, it is the one point completely open to
debate (think of snprintf())
Yes, but how would the user know how much to allocate?
The problem is that the user can't know how long the result string
will be.

But of course he can. Using a call to str_repl!
Really, just the same as snprintf() but better
I think the number of replacements is only useful for knowing
if there was any replaceements at all, i.e; != 0

Nope. However, if you think so, then returning 0/non-zero
is enough (even an empty destination string takes up at
least 1 byte).
Thanks for your input.

You're welcome :)

Cheers
Michael
 
J

jacob navia

(e-mail address removed) a écrit :
I would rather not depend on the user to allocate the memory. If it
were me, and I've done it before, I would let the function realloc() as
necessary. Of course, this is not typical C library behavior. But I
like to make my string functions behave like scripting languages as
much as possible.

The only problem with this kind of implementation is that this may
modify the pointer to the string so all other references to the string
will be invalid after the function call. And of course, realloc() may
fail so the function must also return a fail value when realloc()
fails.
Yes, if you enter memory allocation you open a can of
worms. You can't realloc if InputString was NOT allocated
with malloc!

The InputString could be a static buffer or a local array
This code would fail
int fn(void)
{
char buf[256];
strcpy(buf,"a sentence with a Word");
strrepl(buf,"Word","words",buf);
}
Of course within an application that uses malloc always for the strings
a realloc solution is sensible, but within a C library it is not.

jacob
 
N

Netocrat

Hi guys!

I like C because is fun. So, I wrote this function for the lcc-win32
standard library: strrepl.

I thought that with so many "C heads" around, maybe we could improve it
in a collective brainstorming session.

Let's discuss some C here, for a change :)

Specs:
-----
Function: strrepl

Synopsis

#include <string.h>
char *strrepl(char *InputString, char *StringToFind,
char *Replacement, char *Output);

I agree with others that a better return type is size_t rather than int
and that the first three parameters would be better const-qualified.
Also Replacement is a better term than StringToReplace which is used
further down - the latter is easy to confuse with the term StringToFind.
Description

The strrepl function replaces in InputString all occurrences of
StringToFind by Replacement, writing the modified contents into the
Output string. The original input string is not modified.

You could also mention that InputString must be a non-NULL pointer, and
that StringToFind must be a non-empty string since the function doesn't
check for either of these.
If the Output argument is NULL, strrepl will return the space that would
be needed (including the terminating zero) for the replacement.

If Replacement is NULL and Output is not NULL, all occurrences of
StringToFind will be erased.

Don't you think it would be better not to have to deal with that
unnecessary special-case? I found the conditionals this required in the
function distracting. Passing an empty string achieves the same result.
Returns

The strrepl function returns the needed length for the replacements if
its Output argument is NULL. If not, it returns the number of
replacements done.

Others have suggested not having a dual-meaning return; I think it's OK
since it simplifies the function and its prototype.
Code:
-----
#include <string.h>
int strrepl(char *InputString,char *StringToFind,
char *StringToReplace,char *output)
{
char *offset = NULL, *CurrentPointer = NULL;
int insertlen;
int findlen = strlen(StringToFind);
int result = 0;

All three of the above would be better as size_t.
if (StringToReplace)
insertlen = strlen(StringToReplace);
else
insertlen = 0;
if (output) {
if (output != InputString)
memmove(output,InputString,strlen(InputString)+1);

InputString = output;

The assignment of an output variable to a parameter with "Input" in its
name made it harder for me to read the code.
}
else
result = strlen(InputString)+1;

while (*InputString) {
offset = strstr(!offset?InputString:CurrentPointer,StringToFind);
if (offset == NULL)
break;
CurrentPointer = (offset + (output ? insertlen : findlen)); if
(output) {
strcpy (offset, (offset + findlen));

* I have comments on this further down.
memmove (offset + insertlen,
offset, strlen (offset) + 1);

** and this.
if (insertlen)
memcpy (offset, StringToReplace, insertlen);
result++;
}
else {
result -= findlen;
result += insertlen;
}
}
return result;
}

All kinds of comments are welcome, regarding the code, the interface
design, the documentation, etc etc.

Your function allows for the possibility that InputString and output
alias. Unfortunately the code is lowest-common denominator for the
worst-case possibility that they do alias and that the replacement token
is longer than the token to replace.

Many optimisations are possible when neither of these conditions are
true. For example the line I've asterisked is very inefficient for
multiple matches - it copies the entire trailing part of the string for
each match. The line I've double-asterisked is also very inefficient -
it's unnecessary to call strlen as well as move the entire trailing part
of the string when the input and output don't alias.

Below is a version of a function from a previous thread [1] - based on
code and comments from Mark Haigh, Skarmandar, Dave Thompson and myself -
modified for your prototype and desired behaviour except that it /doesn't/
allow input and output to alias. I haven't used the restrict keyword
since it may be compiled under C90. It also doesn't deal with a NULL
replacement string - for that use "".

It has (commented) potential for UB on large strings that as the
implementor you no doubt would know how to avoid.

This function is about 4 times faster than yours on small strings and
about 300 times faster on large strings where there are many small
replacements - tested using the benchmarking code I wrote for the previous
thread using gcc -O2 under Linux on a P4.

So my suggestion is to deal separately with the worst-case codepath
(possibly using a version of strstr that works backwards). The
second-worst case - where input and output strings alias and the
replacement token is shorter than replaced token - can be better handled
by keeping track of how many replacements have occurred and only moving as
much of the string as is necessary on each match.

The following function contains code optimised for all other (i.e.
non-aliasing) situations.

size_t strrepl(const char *str, const char *old, const char *new, char *ret) {
char *r;
const char *p, *q;
size_t count = 0;
size_t oldlen = strlen(old);
size_t newlen = strlen(new);

if (ret == NULL) {
size_t retlen;
if (oldlen != newlen) {
for (p = str; (q = strstr(p, old)) != NULL; p = q + oldlen)
count++;
/* this is undefined if p - str > PTRDIFF_MAX */
retlen = p - str + strlen(p) + count * (newlen - oldlen);
} else
retlen = strlen(str);
return retlen + 1;
}

for (r = ret, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen) {
/* this is undefined if q - p > PTRDIFF_MAX */
ptrdiff_t l = q - p;
memcpy(r, p, l);
r += l;
memcpy(r, new, newlen);
r += newlen;
count++;
}
strcpy(r, p);

return count;
}

[1] http://groups.google.com/group/comp.lang.c/msg/9af5492fdf79656d
 
J

jacob navia

WOW!!!!!!!

Very nice.

The problem with strings bigger than 2GB (PTRDIFF_MAX) is not relevant
in the environment of lcc-win32. There is a 2GB limit on the 4GB
addressing space since windows uses the higher 2GB. So, that
problem is moot.

I will add code in the first part of the strrepl function
to test for the conditions you mention and call your function if
the requirements are met.

Thanks a lot for your input, it is very good.

jacob
 
B

buda

Michael Mair said:
outsize = str_repl(NULL, 0, replace, source, find, NULL);
if (outsize)
{
char* outbuf = malloc(outsize);
if (outbuf) {
outsize = str_repl(outbuf, outsize, replace, source, find, &numrep);
....
}
....
}
....

That was exactly my point. You obtain the required size from the function
itself just prior to acctually calling it to do "the real work", so I don't
see the point of passing that back, since it is guaranteed that it will need
exactly that much.

However, since this is the behaviour of snprintf(), I imagine I am missing
something :) As I see it, snprintf is useful when you really want to write a
maximum of n characters (and you really want to "cut off" the rest). This
sort of usage doesn't seem likely for str_repl since you can't really know
what will be the contents of the first n characters of the generated output.
 
M

Michael Mair

buda said:
That was exactly my point. You obtain the required size from the function
itself just prior to acctually calling it to do "the real work", so I don't
see the point of passing that back, since it is guaranteed that it will need
exactly that much.

However, since this is the behaviour of snprintf(), I imagine I am missing
something :) As I see it, snprintf is useful when you really want to write a
maximum of n characters (and you really want to "cut off" the rest). This
sort of usage doesn't seem likely for str_repl since you can't really know
what will be the contents of the first n characters of the generated output.

True. But you can do things like
if ( (ret = str_repl(outbuf, outsize, .....)) == 0 || ret > outsize )
{ /* error handling, reallocating, whatever */ }
with this information whereas the "no size parameter" variety
of str_repl() forces you to always correctly compute the necessary
amount of storage or use a "large enough" enough buffer.
The latter is about as safe as using gets(), the former at least
admits the danger of miscalculating the required size.
As soon as the overhead of letting str_repl compute the required
size and checking against the maximum number of bytes is too much,
you can consider rolling your own, unsafe function.

As Jacob wants to add str_repl as an implementation specific
extension to <string.h>, it is my opinion that the function must
not have a design flaw allowing for buffer overflow too easily.
If it were only a question of "is this a nice str_repl", then
I'd just caution against the possible dangers.


Cheers
Michael
 
D

Dik T. Winter

>
> Don't you think it would be better not to have to deal with that
> unnecessary special-case? I found the conditionals this required in the
> function distracting. Passing an empty string achieves the same result.

Perhaps, but it does not take much, see below.
> The following function contains code optimised for all other (i.e.
> non-aliasing) situations.

I will change it such that it will work when new == NULL:
> size_t strrepl(const char *str, const char *old, const char *new, char *ret) {
> char *r;
> const char *p, *q;
> size_t count = 0;
> size_t oldlen = strlen(old); Modify:
> size_t newlen = strlen(new);
to:
** size_t newlen;
** if (new == NULL) { new = ""; }
** newlen = strlen(new);
> if (ret == NULL) {
> size_t retlen;
> if (oldlen != newlen) {
> for (p = str; (q = strstr(p, old)) != NULL; p = q + oldlen)
> count++;
> /* this is undefined if p - str > PTRDIFF_MAX */
> retlen = p - str + strlen(p) + count * (newlen - oldlen);
> } else
> retlen = strlen(str);
> return retlen + 1;
> }
>
> for (r = ret, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen) {
> /* this is undefined if q - p > PTRDIFF_MAX */
> ptrdiff_t l = q - p;
> memcpy(r, p, l);
> r += l;
> memcpy(r, new, newlen);
> r += newlen;
> count++;
> }
> strcpy(r, p);
>
> return count;
> }
>
> [1] http://groups.google.com/group/comp.lang.c/msg/9af5492fdf79656d
>
> --
> http://members.dodo.com.au/~netocrat
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top