Replacing a word in a string

W

websnarf

jacob said:
Yes, but how would the user know how much to allocate?
The problem is that the user can't know how long the result string
will be.

I would rather not depend on the user to allocate the memory. If it
were me, and I've done it before, I would let the function realloc() as
necessary. Of course, this is not typical C library behavior. But I
like to make my string functions behave like scripting languages as
much as possible.

The only problem with this kind of implementation is that this may
modify the pointer to the string so all other references to the string
will be invalid after the function call. [...]

Yeah, you need to go whole hog (unlike the Viega's SafeStr library
which just suffers from exactly the flaw you point out) if you are
going to do this. But it is well worth it. You improve performance,
significantly ease the use, improve your safety, and so on. So it ends
up feeling as powerful as a scripting language, and yet delivers
performance that's even *higher* than what the C library delivers.

You can see for yourself with "the Better String Library" (link below)
which, by the way, has an aliasing-safe, in-place, high performance
"find and replace" string function.
 
N

Netocrat

The problem with strings bigger than 2GB (PTRDIFF_MAX) is not relevant
in the environment of lcc-win32.

I've left the comments in the code for comp.lang.c purposes.
I will add code in the first part of the strrepl function to test for
the conditions you mention and call your function if the requirements
are met.

We can do better than that. I hinted at how but got inspired to implement
it.

The second-hardest scenario (input and output pointers alias; replacement
token shorter than or equal to replaced token) can be dealt with easily by
modifying the function I posted to use memmove instead of memcpy in one of
the loop body calls, and at the end instead of strcpy.

The worst-case scenario (as above but replacement token longer than
replaced token) can be dealt with using a reverse strstr function. I've
implemented a naive portable version but as the compiler author you may be
able to improve on it using asm tricks. Anyhow gcc optimises it pretty
well.

So now every code-path in the function is avoiding redundant operations.
Even for worst-case scenario with the non-library str_rev_str
implementation the new function runs about 150 times faster than the
original for large strings with many small replacements.

In response to Dik T. Winter: yes, adding NULL-handling is trivial, but
IMO it unnecessarily clutters the code and an empty replacement string
is best represented by "" alone. This is consistent with other string
library functions.

Mildly tested:

/** A version of strstr that works in reverse - starting at string end and
* searching towards the beginning.
*
* strend and tokend must point to the last char before the terminating nul,
* not to the nul itself.
*
* Returns NULL on not found, otherwise a pointer to the last char of tok
* within str ("last char" with the above meaning).
*/
const char *str_rev_str(const char *strend, const char *strstart,
const char *tokend, const char *tokstart) {
const char *t, *last, *p;

for (p = strend; ; p = last - 1) {
for (; *p != *tokend; p--) {
if (p == strstart)
return NULL;
}
last = p;
t = tokend;
do {
if (t == tokstart) {
/* this is undefined if tokend - tokstart > PTRDIFF_MAX */
return p + (tokend - tokstart);
} else if (p == strstart)
return NULL;
} while (*--p == *--t);
if (last == strstart)
return NULL;
}
}

/** Call only when the string is being modified in-place and newlen > oldlen
*/
size_t repl_inplace_rev(char *str, const char *old, const char *new,
size_t oldlen, size_t newlen, size_t orglen, size_t retlen) {
const char *oldend = old + oldlen - 1; size_t count = 0;
char *r = str + retlen;
const char *q, *p = str + orglen - 1;

*r = '\0';
for(; (q = str_rev_str(p, str, oldend, old)) != NULL; p = q - oldlen) {
/* this is undefined if p - q > PTRDIFF_MAX */
ptrdiff_t l = p - q;
count++;
memmove(r - l, q + 1, l);
r -= l;
memcpy(r - newlen, new, newlen);
r -= newlen;
if (q - str < oldlen)
/* don't drop off the beginning of the string when there's a
* match at the beginning */
break;
}

return count;
}

size_t repl_get_lengths(const char *str, const char *old, size_t oldlen,
size_t newlen, size_t *orglen)
{
const char *p, *q;
size_t retlen;

if (oldlen != newlen) {
size_t count = 0;
for (p = str; (q = strstr(p, old)) != NULL; p = q + oldlen)
count++;
/* this is undefined if p - str > PTRDIFF_MAX */
*orglen = p - str + strlen(p);
retlen = *orglen + count * (newlen - oldlen);
} else
retlen = *orglen = strlen(str);

return retlen;
}

size_t str_repl(const char *str, const char *old, const char *new, char *ret) {
char *r;
const char *p, *q;
size_t count = 0;
size_t oldlen = strlen(old);
size_t newlen = strlen(new);

if (ret == NULL || (str == ret && newlen > oldlen)) {
size_t orglen;
size_t retlen = repl_get_lengths(str, old, oldlen, newlen, &orglen);
if (ret == NULL)
return retlen + 1;
count = repl_inplace_rev(ret, old, new, oldlen, newlen, orglen, retlen);
} else {
for (r = ret, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen) {
/* this is undefined if q - p > PTRDIFF_MAX */
ptrdiff_t l = q - p;
/* when ret can't alias str, can use memcpy instead */
memmove(r, p, l);
r += l;
memcpy(r, new, newlen);
r += newlen;
count++;
}
/* when ret can't alias str, can instead use strcpy(r, p); */
memmove(r, p, strlen(p) + 1);
}
return count;
}
 
D

Dik T. Winter

> In response to Dik T. Winter: yes, adding NULL-handling is trivial, but
> IMO it unnecessarily clutters the code and an empty replacement string
> is best represented by "" alone. This is consistent with other string
> library functions.

It indeed is consistent. But what bothers me all the time is the
programs that crash because somebody uses NULL in stead of an empty
string somewhere. For the standard library function I do understand
why NULL is not accepted for an empty string. Why use an empty string
for those functions at all? Can you name me a single string library
function that makes sense to use with the empty string constant ""?
 
N

Netocrat

It indeed is consistent. But what bothers me all the time is the
programs that crash because somebody uses NULL in stead of an empty
string somewhere.

Yes, safety is the only argument I can see for it. And performance is
unlikely to be much of a counter-argument - at least for strrepl.
For the standard library function I do understand why NULL is not
accepted for an empty string. Why use an empty string for those
functions at all? Can you name me a single string library function that
makes sense to use with the empty string constant ""?

I'm not sure about "makes sense" - they are redundant - but "has
reasonable meaning", yes:

strcpy(str, "");
strcat(str, "");

This one arguably makes more sense:

strcmp(str, "");

although *str != '\0' is simpler.
 
W

websnarf

Netocrat said:
I've left the comments in the code for comp.lang.c purposes.


We can do better than that. I hinted at how but got inspired to implement
it.

Your idea has a problem:

char out[1024];
size_t l = str_repl ("ababa", "aba", "cccc", out);

This causes an expand, which therefore will invoke your reverse scan
and replace algorithm. However, as you can see, a correct foward
replace should result in:

"ccccba"

but your implementation will output:

"abcccc"
 
T

those who know me have no need of my name

in comp.lang.c i read:
I like C because is fun. So, I wrote this function for the lcc-win32
standard library: strrepl.

any chance you will stop invading the standard namespace with your
extensions?

otherwise, and not counting bugs in the code, it sounds like a fine
function to have in one's personal library.
 
J

Jordan Abel

in comp.lang.c i read:


any chance you will stop invading the standard namespace with your
extensions?

It's the implementation-reserved namespace, not the standard one. It's
no worse than posix strdup()
 
N

Netocrat

]
Your idea has a problem:

char out[1024];
size_t l = str_repl ("ababa", "aba", "cccc", out);

This causes an expand, which therefore will invoke your reverse scan
and replace algorithm. However, as you can see, a correct foward
replace should result in:

"ccccba"

but your implementation will output:

"abcccc"

Ack, good catch. It also requires two search iterations through the
string where jacob's original requires only one: when making a single
replacement, the original function can be slightly faster on my machine.

Another approach that solves both of these problems is recursion. The
depth could be limited if that's a concern - with small penalties to
performance and code complexity. As it is the function completes even
faster on my machine than the reverse scan approach.

Aside from recursion, this code adds a couple of minor optimisations in
the main function to call memmove conditionally:

size_t repl_get_len(const char *str, const char *old, size_t oldlen,
size_t newlen)
{
const char *p, *q;
size_t retlen;

if (oldlen != newlen) {
size_t count = 0;
for (p = str; (q = strstr(p, old)) != NULL; p = q + oldlen)
count++;
/* this is undefined if p - str > PTRDIFF_MAX */
retlen = p - str + strlen(p) + count * (newlen - oldlen);
} else
retlen = strlen(str);

return retlen;
}

/** Call only when the string is being modified in-place and newlen > oldlen
*/
static void str_repl_recurs(char *str, const char *old, const char *new,
size_t oldlen, size_t newlen, size_t *count)
{
char *p;
size_t count_org = *count;
size_t diff = newlen - oldlen;

if ((p = strstr(str, old)) == NULL) {
if (count_org > 0)
memmove(str + diff * count_org, str, strlen(str) + 1);
} else {
(*count)++;
str_repl_recurs(p + oldlen, old, new, oldlen, newlen, count);
if (count_org > 0)
memmove(str + diff * count_org, str, p - str);
memcpy(p + diff * count_org, new, newlen);
}
}

size_t str_repl(const char *str, const char *old, const char *new, char *ret) {
size_t count = 0;
size_t oldlen = strlen(old);
size_t newlen = strlen(new);

if (ret == NULL) {
return repl_get_len(str, old, oldlen, newlen) + 1;
} else if (str == ret && newlen > oldlen) {
str_repl_recurs(ret, old, new, oldlen, newlen, &count);
} else {
char *r;
const char *p, *q;
for (r = ret, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen) {
/* this is undefined if q - p > PTRDIFF_MAX */
ptrdiff_t l = q - p;
if (count > 0 || str != ret) /* to optimise; can be removed */
memmove(r, p, l);
r += l;
memcpy(r, new, newlen);
r += newlen;
count++;
}
if (count > 0 || str != ret) /* to optimise; can be removed */
memmove(r, p, strlen(p) + 1);
}
return count;
}
 
R

Richard Bos

Netocrat said:
I'm not sure about "makes sense" - they are redundant - but "has
reasonable meaning", yes:

strcpy(str, "");
strcat(str, "");

They don't make sense with empty string _constants_, no. However,
strcpy() cannot know whether it got passed a string variable or a string
constant. Copying a string variable which may or may not be empty
depending on user input (e.g., a "middle name" field for personal data)
makes all the sense in the world.
Making a null pointer equal to an empty string - effectively, passing a
"variable" which is always empty, not depending on user input, does not
make sense, though.

Richard
 
F

Flash Gordon

Richard said:
They don't make sense with empty string _constants_, no. However,
strcpy() cannot know whether it got passed a string variable or a string
constant. Copying a string variable which may or may not be empty
depending on user input (e.g., a "middle name" field for personal data)
makes all the sense in the world.

#define BUILD_TYPE ""
where other possible values include "alpha" and "beta".

/* Stuff building version information in to a string */
strcat(ver_str, BUILD_TYPE);

if (strcmp(BUILD_TYPE,"alpha") == 0) {
/* Warn user it is an alpha build */
}

There are, of course, other ways to achieve this, but it does at least
make some sense.

I'm sure there are other situations where an external (to your piece of
the development) header file defines a string with #define that, under
some situations or building on some machines, might be an empty string.
Making a null pointer equal to an empty string - effectively, passing a
"variable" which is always empty, not depending on user input, does not
make sense, though.

Well, in the above example you could have
#define BUILD_TYPE NULL
but this would only work if the string routines were defined to handle
null pointers appropriately.
 
D

Dik T. Winter

>
> They don't make sense with empty string _constants_, no. However,
> strcpy() cannot know whether it got passed a string variable or a string
> constant. Copying a string variable which may or may not be empty
> depending on user input (e.g., a "middle name" field for personal data)
> makes all the sense in the world.
> Making a null pointer equal to an empty string - effectively, passing a
> "variable" which is always empty, not depending on user input, does not
> make sense, though.

But it does make sense in a function as discussed here.
 
M

Mark McIntyre

It's the implementation-reserved namespace, not the standard one. It's
no worse than posix strdup()

.... and arguably, he's allowed to do this ,because ie is writing an
implementation.
 
J

jacob navia

Mark McIntyre a écrit :
... and arguably, he's allowed to do this ,because ie is writing an
implementation.

Yes, the standard explicitely allows for implementations
to define new "strxxx" functions.
 
N

Netocrat

Dik said:
Richard Bos writes:
[string arguments in standard library functions]
Making a null pointer equal to an empty string - effectively, passing
a "variable" which is always empty, not depending on user input, does
not make sense, though.

But it does make sense in a function [strrepl] as discussed here.

When designing a relational database table it's IMO good practice to
minimise the number of columns that may be NULL [*]. I view the string
functions in the same way. A null pointer is not a string, and should
not be accepted in place of one.

You've pointed out a reasonable safety argument. This goes both ways
though: since the Standard doesn't require a null pointer to be acceptable
as an empty string value even where it could be, allowing the programmer
to do so for this individual function may encourage an unsafe habit.

[*] the meaning of NULL is different in this context but it's similar
enough for the analogy to hold.
 
D

Dave Thompson

When designing a relational database table it's IMO good practice to
minimise the number of columns that may be NULL [*]. I view the string
functions in the same way. A null pointer is not a string, and should
not be accepted in place of one.
Agree here, mostly.
[*] the meaning of NULL is different in this context but it's similar
enough for the analogy to hold.

Although I once had trouble explaining to someone the difference in a
C API between a pointer to a variable used to indicate whether a value
is (SQL) NULL or not, and a null pointer to indicator which cannot so
indicate and thus gives an error for (SQL) NULL value.
- David.Thompson1 at worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,172
Messages
2,570,934
Members
47,473
Latest member
ChristelPe

Latest Threads

Top