Challenge: tightest code to find-replace a string

B

Ben Bacarisse

BartC said:
I tried something based on your code (but then rewritten). It worked fine,
until I tried it on "ababac", which didn't work.

I tried it on s/a/b/ in input "aaa" and got "bab". From the symptom,
that looks easy to fix but maybe not (I find your code hard to reason
about).

int main(int argc, char **argv) {
#define infile "kkk1"
#define outfile "kkk2"
FILE *fin,*fout;
int i,c;

fin=fopen(infile,"r");
fout=fopen(outfile,"w");
if (!fin || !fout) exit(1);

replace_in_file("abac","ABAC",fin,fout);
fclose(fin);
fclose(fout);
}

Do you really find this sort of test driver to be simpler than one I
posted? I am trying to imagine your testing procedure. Do you have the
source file the data file open in an editor and you keep re-compiling?
How you verify the output? How you build-up a set of regression tests?
 
M

Malcolm McLean

It matters very much if the handling of overlapping cases can trigger an
infinite loop.
it mustn't crash or fail to return on any input. But it probably doesn't matter
if replace 'abcabc' 'x' called on abcabcabc returns xabc xx or even abcabcabc.
And even for use on human-readable text, I fail to see how a precise and
unambiguous definition of the behavior is unimportant.
The requirement for a function can be something like "makes the space invader
crash realistically". That's not reducible to a pure mathematical specification.
 
K

Keith Thompson

Malcolm McLean said:
it mustn't crash or fail to return on any input. But it probably
doesn't matter if replace 'abcabc' 'x' called on abcabcabc returns
xabc xx or even abcabcabc.
Nonsense.

The requirement for a function can be something like "makes the space
invader crash realistically". That's not reducible to a pure
mathematical specification.

Yes, the requirement for *some* functions can be that vague.

But a string replacement function *can* be specified precisely,
and there's no excuse for not doing so. And if it doesn't behave
consistently, I'm not interested in using it; I'll just use a
different function that's properly defined.

Are you perhaps assuming that it doesn't matter because
search-and-replace is performed interactively, and therefore the
user can immediately see the results? Are you familiar with sed?
 
M

Malcolm McLean

Are you perhaps assuming that it doesn't matter because
search-and-replace is performed interactively, and therefore the
user can immediately see the results? Are you familiar with sed?
I'm familiar with the name but I've never used it.
If you're implementing see presumably you have to implement the quirks
as well, to avoid breaking existing scripts.
But if "search and replace" is called on overlapping input, then almost
always it indicates that the caller didn't expect the pattern to be applied
to that situation. There's no ideal behaviour with a "pattern / replacement"
interface. Probably you should pass an extra parameter in to indicate whether
to be greedy, non-greedy, or flag up an error.
 
R

Richard Bos

Malcolm McLean said:
it mustn't crash or fail to return on any input. But it probably doesn't matter
if replace 'abcabc' 'x' called on abcabcabc returns xabc xx or even abcabcabc.

On the contrary. If my word processor gave me anything but "xabc" under
those circumstances, I'd start using another.

Richard
 
B

BartC

Ben Bacarisse said:
I tried it on s/a/b/ in input "aaa" and got "bab". From the symptom,
that looks easy to fix but maybe not

Yes; the next char after a successfully matched sequence is output directly,
instead of being tested for the start of a new sequence.

A carefully placed goto fixed that. (Alternatively, replacing
'outchar(fout,c)' in the following by 'pendcount=0; pendchar=c;' seemed to
work.)

if (*mp==0) { /* Full match obtained ... */
outstrn(fout,repl,repllen);
++nmatches;
outchar(fout,c);

Probably there are other issues; I just wanted to see what the code might
look like (ie. not that elegant).
(I find your code hard to reason about).

I deliberately used a non-nested loop with some state (inside=0 or 1) to
know what it was up to.
Do you really find this sort of test driver to be simpler than one I
posted? I am trying to imagine your testing procedure. Do you have the
source file the data file open in an editor and you keep re-compiling?

I had several versions of the replace_in_file() call, of which all but one
were commented out. For this simple program, I just stuck with the same
input file, although could easily have been switched too within the source.
(For a change, I used actual C to develop this, and it was the braces that
caused most problems.)
How you verify the output?

By visual examination. I had a last line that was system("ed " outfile);
which displayed the output.
How you build-up a set of regression tests?

If this was a serious project rather than just a diversion (from my compiler
projects), then I'd use a different approach. Maybe then it's worth the
effort of setting up the input data from the command line parameters, which
I've always found fiddly.

(Or more likely, I would have the test cases in a table of strings within
the program. The inchar() and outchar() routines lend themselves to being
adapted for string input and output.)
 
S

Stefan Ram

On the contrary. If my word processor gave me anything but "xabc" under
those circumstances, I'd start using another.

You are responding to:

|... called on abcabcabc returns xabc or even abcabcabc.

Malcolm wrote:

|... called on abcabcabc returns xabc xx or even abcabcabc.

Malcolm failed to clearly indicate the boundaries of the string.
Still, taking » xx« to be a part of the string, is the only way
to parse the English sentence, as long as there is no comma after
the »xabc«.
 
S

Stefan Ram

Still, taking » xx« to be a part of the string, is the only way

#include <stdio.h>
#include <string.h> /* for memset */
#include <stdlib.h>

/* streaming search-and-replace */

/* this is an internal helper function for the following srp function. */

static void srp_write( char const * from, size_t length, void( *put )( char ) )
{ char a; while( a = *from++, length-- )put( a ); }

/*
Copies from input to output replacing occurences of from-string by to-string.

@param put this must be a putchar-like function representing the
output to write the result to.

@param get a getchar-like function representing the
input to read from.

@param from the search string. It might contain
0-characters (untested). The function will search
for that string in the input.

@param from_length the length of the search string.
The function will dynamically allocate two buffers
with approximately as many bytes as this from_length.

@param to the replacement string. It might contain
0-characters (untested). The function will replace
instances of the search string by this replacement string.

@param to_length the lenght of the replacement string.

@return 0 for success, other values to indicate lack of
dynamic memory. */

int srp
( void( *put )( char ),
int( *get )(),
char const * const restrict from, size_t const from_length,
char const * const restrict to, size_t const to_length )
{ /* algorithm was invented by Stefan Ram while writing this function */
char * const b = malloc( 2 + from_length ); if( !b )return 1;
char * const s = malloc( 2 + from_length ); if( !s ){ free( b ); return 2; }
int a; size_t o = 0; int t = 0;
while( a =( t > 0 ? s[ --t ]: get() ), a > 0 )
{ if( o >= from_length ){ srp_write( to, to_length, put ); o = 0; }
{ if( a == o[ from ]){ o[ b ]=( char )a; ++o; }
else { if( o ){ put( b[ 0 ]); s[ t++ ]=( char )a;
for( size_t i = o - 1; i > 0; -- i )
s[ t++ ]= i[ b ]; o =( size_t )0; } else put(( char )a ); }}}
if( o ){ if( o >= from_length )srp_write( to, to_length, put );
else srp_write( b, o, put ); } free( s ); free( b ); return 0; }
 
S

Stefan Ram

#include <string.h> /* for memset */

It seems that this was an relict from a previous version and is
not necessary anymore.
while( a =( t > 0 ? s[ --t ]: get() ), a > 0 )

To make this process NUL characters ('\0'), the condition
»a > 0« should be changed into »a >= 0«. As I have not tested
NUL character processing, it might still not work, but this is
one step into this direction.
 
S

Stefan Ram

To make this process NUL characters ('\0'), the condition

The new version now needs just a single buffer:

int srp( void( *put )( char ), int( *get )(),
char const * const restrict from, size_t const from_length,
char const * const restrict to, size_t const to_length )
{ /* algorithm was invented by Stefan Ram while writing this function */
char * const s = malloc( 2 + from_length ); if( !s )return 1;
int a; size_t o = 0; int t = 0;
while( a =( t > 0 ? s[ --t ]: get() ), a >= 0 )
{ if( o >= from_length ){ srp_write( to, to_length, put ); o = 0; }
{ if( a == o[ from ])++o; else { if( o )
{ put( from[ 0 ]); s[ t++ ]=( char )a;
for( size_t i = o - 1; i > 0; -- i )
s[ t++ ]= i[ from ]; o =( size_t )0; } else put(( char )a ); }}}
if( o ){ if( o >= from_length )srp_write( to, to_length, put );
else srp_write( from, o, put ); } free( s ); return 0; }

In the meantime, a single test with a NUL character in the
input and the search string was passed.
 
K

Keith Thompson

Malcolm McLean said:
I'm familiar with the name but I've never used it.

It's a non-interactive tool, common (in fact, nearly universal) on
UNIX-like systems. The name is derived from the phrase "stream editor".
It's commonly used in scripts (i.e., programs written in the language
implemented by a shell).
If you're implementing see presumably you have to implement the quirks
as well, to avoid breaking existing scripts.
But if "search and replace" is called on overlapping input, then
almost always it indicates that the caller didn't expect the pattern
to be applied to that situation. There's no ideal behaviour with a
"pattern / replacement" interface. Probably you should pass an extra
parameter in to indicate whether to be greedy, non-greedy, or flag up
an error.

I've been using sed for decades. As far as I know, it has no flag to
indicate different semantics for overlapping input, and I've never
encountered a need for such a flag.

The POSIX specification is here:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

The relevant command in this context would be:

sed 's/foo/bar/g'

In the documentation, a BRE is a "Basic Regular Expression"; in this
case, we're considering a simple string. The 'g' suffix causes it to
replace all non-overlapping instances of "foo"; by default it only
replaces the first on each line.

An example:

$ echo banana | sed 's/ana/Anana/g'
bAnanana

As far as I can tell, no other output would be consistent with the
specification (there are two occurrences of "ana", but they overlap so
only the first is replaced) -- and I would consider any other output
unacceptable.

Though I can't think of a concrete example, it's likely I've used sed on
text with overlapping occurrences of a substitution pattern and relied
on the way it behaves.

The problem you perceive with overlapping input is not a problem
in practice. There is no real ambiguity in the behavior, and
therefore no need for extra flags to resolve it.
 
S

Stefan Ram

In the meantime, a single test with a NUL character in the
input and the search string was passed.

I should add that the search string's length must be
greater than 0. (This is a part of the interface
specification.)

And here is a collection of test cases, including the
length of each string, in the order: input, replace_from,
replace_to, and expected_output:

test( "aaa", 3, "a", 1, "b", 1, "bbb", 3 );
test( "abcabcabc", 9, "abcabc", 6, "x", 1, "xabc", 4 );
test( "alpha", 5, "alpha", 5, "epsilon", 7, "epsilon", 7 );
test( "ababac", 6, "abac", 4, "ABAC", 4, "abABAC", 6 );
test( "alphabebetagamma", 16, "bebe", 4, "epsilon", 7, "alphaepsilontagamma", 19 );
test( "alphabebetagamma", 16, "beta", 4, "epsilon", 7, "alphabeepsilongamma", 19 );
test( "alphabebebetagamma", 18, "beta", 4, "epsilon", 7, "alphabebeepsilongamma", 20 );
test( "abcdefghi", 9, "def", 3, "DEF", 3, "abcDEFghi", 9 );
test( "__01234__", 9, "01234", 5, "!", 1, "__!__", 5 );
test( "__012301234__", 13, "01234", 5, "!", 1, "__0123!__", 9 );
test( "__0123012301234__", 17, "01234", 5, "!", 1, "__01230123!__", 13 );
test( "__0123401234__", 14, "01234", 5, "!", 1, "__!!__", 6 );
test( "0123401234__", 12, "01234", 5, "!", 1, "!!__", 4 );
test( "__0123401234", 12, "01234", 5, "!", 1, "__!!", 4 );
test( "alpha", 5, "alpha", 5, "", 0, "", 0 );
test( "aalpha", 6, "alpha", 5, "", 0, "a", 1 );
test( "aaalpha", 7, "alpha", 5, "", 0, "aa", 2 );
test( "alphaX", 6, "alpha", 5, "", 0, "X", 1 );
test( "aalphaX", 7, "alpha", 5, "", 0, "aX", 2 );
test( "aaalphaX", 8, "alpha", 5, "", 0, "aaX", 3 );
test( "", 0, "alpha", 5, "", 0, "", 0 );
test( "alpha", 5, "a", 1, "x", 1, "xlphx", 5 );
test( "abcd\0fghi", 9, "d\0f", 3, "DEF", 3, "abcDEFghi", 9 );

The last test case includes a NUL character.
 
C

Chris M. Thomasson

"DFS" wrote in message news:[email protected]...
* reads an existing file
* writes changes to new file
* counts replacements made by line
* counts total replacements made
* no fancy usage of sed!

Can I get the most recent requirements as a summary?

AFAICT, you want a one-to-many transformation.

A simple example:

s = "abcdefgabcdefghijklmnop"


A possible transformation could be:

"abcdefg" => "aaaAAaa"



So:

s_t = "aaaAAaaaaaAAaahijklmnop"


Right?



FWIW, this can be easily rendered into a case-sensitive function.
 
M

Malcolm McLean

As far as I can tell, no other output would be consistent with the
specification (there are two occurrences of "ana", but they overlap so
only the first is replaced) -- and I would consider any other output
unacceptable.
It's slightly easier to write a greedy matcher. So probably that's what
they did, and wrote the specifications afterwards.
The problem you perceive with overlapping input is not a problem
in practice. There is no real ambiguity in the behavior, and
therefore no need for extra flags to resolve it.
No it's not. If you call a pattern on an input with overlapping
sequences, almost certainly you haven't thought through the output
you actually want. The fix is usually to extend the pattern.
 
S

Stefan Ram

Malcolm McLean said:
No it's not. If you call a pattern on an input with overlapping
sequences, almost certainly you haven't thought through the output
you actually want. The fix is usually to extend the pattern.

In the special case that I gave, the replacements where also
overlapping. Only then is it possible to do both replacements
at the same time.

Overlapping strings occur rarely. It might make sense when
information is stored in a compressed manner and strings are
obtained from a large buffer by offset and length. It would
be funny if one could find such as case in the DNA for
protein encoding.

However, my point was something else:

Before the implementation of an interface can be
tackled in source code, its specification in the
English language must be complete and clear.

This means that even requirements that might seem
»self-evident« to some parties still needs to be stated
in a univocal manner.

Dennis Ritchie, Donald E. Knuth, were or are they not all
masters of the English language, too?

»I've found that some of the best [Software ]developers
of all are English majors. They'll often graduate with
no programming experience at all, and certainly without
a clue about the difference between DRAM and EPROM.

But they can write. That's the art of conveying
information concisely and clearly. Software development
and writing are both the art of knowing what you're going
to do, and then lucidly expressing your ideas.«

Paul Potts
http://praisecurseandrecurse.blogspot.com/2007/03/english-majors-as-programmers.html

»Besides a mathematical inclination, an exceptionally
good mastery of one's native tongue is the most vital
asset of a competent programmer.«

Edsgar Dijkstra

»While sloppy writing does not invariably mean sloppy
thinking, we've generally found the correlation to be
strong -- and we have no use for sloppy thinkers.
If you can't yet write competently, learn to.«

Eric Raymond

http://www.catb.org/~esr/faqs/hacker-howto.html#skills4

»The narrative measures of conjunction use, event
content, perspective shift, and mental state reference
were significantly predictive of later Math scores.«

http://www.arts.uwaterloo.ca/~doneill/papers/Storytelling and math.pdf

»I have never, ever, ever seen a great software developer
who does not have amazing attention to detail.«

http://www.softwarebyrob.com/articles/Personality_Traits_of_the_Best_Software_Developers.aspx
 
S

Stefan Ram

However, my point was something else:

And also:

Whenever a part of requirements for a program
is given in the subject line of a Usenet post,
it should be repeated in the body of the message.
 
S

Stefan Ram

information is stored in a compressed manner and strings are
obtained from a large buffer by offset and length. It would
be funny if one could find such as case in the DNA for
protein encoding.

The chromosome of the phi X 174 virus encodes nine different
proteins. But the virus DNA actually seemed to be too small
to encode them all!

In 1977, Fredrick Sanger discovered that the same parts of
DNA of this virus was encoding more than one protein.

An online encyclopedia uses the term:

»its highly overlapping genome«

http://en.wikipedia.org/wiki/Phi_X_174
 
C

Chris M. Thomasson

"Chris M. Thomasson" wrote in message
"DFS" wrote in message news:[email protected]... [...]
Can I get the most recent requirements as a summary?
AFAICT, you want a one-to-many transformation.
A simple example:
s = "abcdefgabcdefghijklmnop"
A possible transformation could be:
"abcdefg" => "aaaAAaa"
So:
s_t = "aaaAAaaaaaAAaahijklmnop"
Right?


WRT "infinitely" overlapping search string in the to be transformed string:

s = "01"

Transform:

"01" => "0101"

s[0] = "01"
s[1] = "0101"
s[2] = "01010101"
s[3] = "0101010101010101"
[...]


You would have to set an iteration threshold to help the
program avoid an infinite loop, and an ever expanding
buffer.


It expands like a L-system fractal.
 
B

BartC

Ben Bacarisse said:
How you verify the output? How you build-up a set of regression tests?

Stefan Ram kindly posted a set of test cases (see elsewhere in the thread).

They all passed in my code (after converting to operating with string i/o),
except two.

One involved an embedded nul character. The other was an odd one, because it
worked with the file version. But I've had to change the code dealing with
the remaining characters on exiting the main loop, to make that work.

The new code is here: http://pastebin.com/wXyUSjcw

This contains one test case. Stefan's list of test cases can be pasted
directly into the main() function. (I didn't want to distribute someone
else's code on that site.)

(BTW if you think my code is difficult to follow, have a look at his
find-replace code, and see if you still have the same opinion...)
 
S

Stefan Ram

Chris M. Thomasson said:
You would have to set an iteration threshold to help the
program avoid an infinite loop, and an ever expanding
buffer.

In C, one can write: »while(1);«.

Not having an iteration threshold in C means that the
language is simple and fast, albeit not without risks.

One can as well decide to specify requirements in this
spirit of C: »If the client calls me to have an infinite
loop then this be his will.« So, there is no need for an
iteration threshold, just to avoid infinite loops.
(In a given context, there still might be other valid
reasons to specify such a threshold.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top