automated sub-string search/replace algorithm...

C

Chris M. Thomasson

Here is the code:

http://clc.pastebin.com/RM3hyqVE


I am testing two algorithms here. One created by me, and the other created
by Edward Nilges. Here is the post where I gained the source of Edward's
algorithm:

http://groups.google.com/group/comp.lang.c/browse_frm/thread/ad9fea19f2f7dd61


This is basically the same as my original automated sub-string search:

http://groups.google.com/group/comp.lang.c/browse_frm/thread/38aee559668ee60c


except it generates a random exchange string and build an expected result
string. It then runs the source string through the given replacement
algorithm and checks to see if everything is Kosher.


So far, both Edwards algorithm and mine are passing 10,000,000 iterations.
That's quite a bit of randomly generated data. I cannot seem to find any
bugs in the algorithms as-is.


Anyway, I created this because I got sick and tired of manually generating
test cases!

;^)
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Here is the code:

http://clc.pastebin.com/RM3hyqVE


I am testing two algorithms here. One created by me, and the other created
by Edward Nilges. Here is the post where I gained the source of Edward's
algorithm:

http://groups.google.com/group/comp.lang.c/browse_frm/thread/ad9fea19f2f7dd61

I feel that I should point out the fact that I very slightly altered
Edward's code. I added `const' qualifiers to some variables, and reformatted
the C++ style comments so I could cleanly compile it. The test compiles
without warning on Comeau.

[...]
 
B

Ben Bacarisse

Chris M. Thomasson said:
This is basically the same as my original automated sub-string search:

http://groups.google.com/group/comp.lang.c/browse_frm/thread/38aee559668ee60c


except it generates a random exchange string and build an expected
result string. It then runs the source string through the given
replacement algorithm and checks to see if everything is Kosher.


So far, both Edwards algorithm and mine are passing 10,000,000
iterations. That's quite a bit of randomly generated data. I cannot
seem to find any bugs in the algorithms as-is.

This is a useful example: 10 million tests and you have not yet found
the bug! It is in a case that I'd test early rather than later and it
survived from the first to the 12th of the 13 posted versions of the
code. Have you tried your automatic tests on even earlier versions?
It would be interesting see what it can pick up.
Anyway, I created this because I got sick and tired of manually
generating test cases!

Hand-picked tests have a lot going for them.
 
C

Chris M. Thomasson

Ben Bacarisse said:
This is a useful example: 10 million tests and you have not yet found
the bug!
;^)




It is in a case that I'd test early rather than later and it
survived from the first to the 12th of the 13 posted versions of the
code. Have you tried your automatic tests on even earlier versions?

I have only tested my own algorithms, and Edwards. When I get some more
time, I will hunt down and add one of the other algorithms posted here.
Well, I did add your `fast replace' algorithm; here is the test:


http://clc.pastebin.com/TzsdaUNW


Look for the `ben_replace()' function. I had to slightly tweak your code in
order to get it compiling cleanly on a non-c99 compiler. BTW, it passes the
test.

:^)



It would be interesting see what it can pick up.

I have been trying to "fool it" into giving a false positive by deliberately
introducing slight errors here and there. The test finds these errors every
damn time. I have not been able to fool it yet. I change a 1 to a 0, or
something little, and BAM! The test fails and spits out the offending datum.



Hand-picked tests have a lot going for them.

They sure do. But, I thought it might be a good idea to throw a shi% load of
random data at it. Who knows, the test might present data that ruins the
algorithm. Data that I did not think about testing. So, perhaps a
combination of hand-picked tests and the automated test is the best option.
Humm...
 
C

Chris M. Thomasson

Chris M. Thomasson said:
I have only tested my own algorithms, and Edwards. When I get some more
time, I will hunt down and add one of the other algorithms posted here.
Well, I did add your `fast replace' algorithm; here is the test:


http://clc.pastebin.com/TzsdaUNW

FWIW, I should probably point out that the following macro defined on line
272:


#define MATCH_MAX 4


is deliberately set to a low value in order to test certain aspects of my
algorithm. The value represents the total number of matches that can be
acquired before I resort to dynamic memory allocation (e.g., `realloc()').
So, if you want the algorithm to "perform better", well, set this value to a
much higher number, say 4096 or something.


[...]
 
N

Nick Keighley

This is a useful example: 10 million tests and you have not yet found
the bug!

is it? I'd like some numbers. How big is the space of all possible
tests? What fraction of that space is 10 million tests? How are the
"random" test distributed? Do we have any reason to suppose this
distribution matches the critical parts of the algoritm? How do you
generate the result strings? Doesn't your test result generator look
very similar to your replace function?

Hand-picked tests have a lot going for them.

oh yes.

[consider source strings of up to 26 letters and a replacement string
of 25 letters. I make it about 1e15 different tests (26^6 * 26^5 =
26^11). Now you can do better than this but you have to be careful not
to add some systematic bias]
 
B

Ben Bacarisse

Nick Keighley said:
is it? I'd like some numbers. How big is the space of all possible
tests? What fraction of that space is 10 million tests? How are the
"random" test distributed? Do we have any reason to suppose this
distribution matches the critical parts of the algoritm? How do you
generate the result strings? Doesn't your test result generator look
very similar to your replace function?

Why are you asking me? I can answer these questions if you like, but
what is the point of them?

It doesn't matter how many tests are done. I suspect, that since the
bug invokes UB, Chris's test program can't find it (at least it can't
be assured of finding it). My point (such as it was) is that the
number of random tests passed bears little relation to the confidence
one should have in a piece of software.
Hand-picked tests have a lot going for them.

oh yes.

[consider source strings of up to 26 letters and a replacement string
of 25 letters. I make it about 1e15 different tests (26^6 * 26^5 =
26^11). Now you can do better than this but you have to be careful not
to add some systematic bias]

I am not sure of the point of this counting. I'm not even sure what
you are counting. I think you agree that carefully chosen tests are
the way to go. This choosing might be done by machine (at least the
last time I looked there was some very promising work in that area)
but they usually have to be picked by humans.
 
N

Nick Keighley

there is a bug? I think I misunderstood you. I thought you were saying
passing 10M tests significantly increased our confidence in the
program. You were actually saying it didn't!

Why are you asking me?  I can answer these questions if you like, but
what is the point of them?

they are rhetorical. I suspect the tested space is enormously smaller
than the entire test space. Even if we limit our string sizes to very
finite values.
It doesn't matter how many tests are done.

well some problems might submit brute force search
 I suspect, that since the
bug invokes UB, Chris's test program can't find it (at least it can't
be assured of finding it).  My point (such as it was) is that the
number of random tests passed bears little relation to the confidence
one should have in a piece of software.

I think we agree


the next bit was pure gibberish I have changed it

replace:-
[consider source strings of up to 26 letters and a replacement string
[...]

with:-

[consider source strings of up to 6 letters and a replacement string
of 5 letters. I make that about 1e15 different tests (26^6 * 26^5 =
26^11). Now you can do better than this but you have to be careful not
to add some systematic bias]

I think the task of analysing a test program that runs in non-
geological time to show it is a reasonable test is comparable to the
task of proving the correctness of the original program.
I am not sure of the point of this counting.  I'm not even sure what
you are counting.

I'm trying, incompetantly, to show just how nasty combinatorial
explosion can be.
 
B

Ben Bacarisse

Nick Keighley said:
there is a bug? I think I misunderstood you. I thought you were saying
passing 10M tests significantly increased our confidence in the
program. You were actually saying it didn't!

Yup. Chris accidentally took a version of the code that had a known
bug in it so, to me, the test became interesting: Can these random tests
find the bug we know to be there? It seems not (and I am not really
surprised).
they are rhetorical. I suspect the tested space is enormously smaller
than the entire test space. Even if we limit our string sizes to very
finite values.


well some problems might submit brute force search

Sure. I meant in this case.
I think we agree

I think we do!

<snip>
 
M

Malcolm McLean

 My point (such as it was) is that the
number of random tests passed bears little relation to the confidence
one should have in a piece of software.
As an example, comnsider a naive implementation of qsort(). If you
feed it random permutations performace will always be O(NlogN). It's
only when you feed it sorted data that it will overflow the stack.
However sorted data is quite likely to be fed to the function when
using for real.
 
I

ImpalerCore

As an example, comnsider a naive implementation of qsort(). If you
feed it random permutations performace will always be O(NlogN). It's
only when you feed it sorted data that it will overflow the stack.
However sorted data is quite likely to be fed to the function when
using for real.

I agree in general, that random sets of data do not test for crashes
as well as boundary testing, and passing sorted data is one such
boundary test to a recursive sorting algorithm.

On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?

for each i in array
create random index k between 0 and array size
swap array with array[k]

I have a recursive (possibly naive) implementation of quicksort that
I've been using, but I haven't exercised it on data sets large enough
to generate stack overflows (yet). However, I do run into the use
case where quicksort is passed almost sorted data. I'm just wondering
if this is a viable (if hackish) method to encourage better average
performance for almost sorted data sets.
 
E

Eric Sosman

[...]
On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?

That could reduce the likelihood of pathological behavior
in Quicksort (if qsort() uses Quicksort), but would not eliminate
it. It might, in fact, scramble an "easy" sequence into a "hard"
one. Also, there are cheaper ways to obtain a similar effect
than to generate N random numbers (you can do it with just one
random number per partitioning pass).
for each i in array
create random index k between 0 and array size
swap array with array[k]


Note that this scrambling algorithm is "random," but biased:
Some rearrangements are more likely than others.
I have a recursive (possibly naive) implementation of quicksort that
I've been using, but I haven't exercised it on data sets large enough
to generate stack overflows (yet). However, I do run into the use
case where quicksort is passed almost sorted data. I'm just wondering
if this is a viable (if hackish) method to encourage better average
performance for almost sorted data sets.

Read Bentley and McIlroy's "Engineering a Sort Function."
(Note that their final version of qsort() sometimes fails to
satisfy 7.20.5p2. I don't see the reason for 7.20.5p2, and the
Rationale is unenlightening, but it's there nonetheless.)
 
M

Malcolm McLean

On Mar 1, 8:51 am, Malcolm McLean <[email protected]>
wrote:


On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?

for each i in array
  create random index k between 0 and array size
  swap array with array[k]

I have a recursive (possibly naive) implementation of quicksort that
I've been using, but I haven't exercised it on data sets large enough
to generate stack overflows (yet).  However, I do run into the use
case where quicksort is passed almost sorted data.  I'm just wondering
if this is a viable (if hackish) method to encourage better average
performance for almost sorted data sets.

to write shuffle (untested)

void shuffle(void *data, size_t N, size_t width)
{
size_t i;
size_t target;

for(i=0;i<N-1;i++)
{
target = (size_t) ((rand()/ (RAND_MAX + 1.0)) * (N-i) + i);
memswap( (unsigned char *)data + (i*width), (unsigned char *)data
+(target *width), width);
}
}

Note deliberate bug in this code. Anyone spot it?
Another snag is that memswap isn't a standard library function. In
fact it's hard to provide an efficient, portable implementation.

Quicksort needs a pivot. The most convenient pivot is the first item
in the array. The problem is that if you choose it, and the data is
sorted or almost sorted, the tree always has one brach with a single
cell. Thus stack overflow is likely.
The generally accepted way to get round the problem is by choosing
three random candidates for pivot, then taking the middle one.
 
C

Chris M. Thomasson

Ben Bacarisse said:
Yup. Chris accidentally took a version of the code that had a known
bug in it so, to me, the test became interesting: Can these random tests
find the bug we know to be there? It seems not (and I am not really
surprised).

Are you referring to Edward's code? I don't know where the most recent
version is. Could you be so kind as to point me toward the bug please? Humm,
is it this one:

http://groups.google.com/group/comp.lang.c/msg/e3677e5f0cb30c43


Here is the version I am working with:

http://groups.google.com/group/comp.lang.c/msg/2c2eb92cd6dbe071


And if I manually feed it:

replace("a", "x", "b");


I am getting a result string of "a".




FWIW, here is a bugged version of Edward's code:

http://groups.google.com/group/comp.lang.c/msg/3b35d28b8e85b2e2

and the test finds an error by seg-faulting...


[...]
 
I

ImpalerCore

[...]
On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?

     That could reduce the likelihood of pathological behavior
in Quicksort (if qsort() uses Quicksort), but would not eliminate
it.  It might, in fact, scramble an "easy" sequence into a "hard"
one.  Also, there are cheaper ways to obtain a similar effect
than to generate N random numbers (you can do it with just one
random number per partitioning pass).
for each i in array
   create random index k between 0 and array size
   swap array with array[k]


     Note that this scrambling algorithm is "random," but biased:
Some rearrangements are more likely than others.


Can you explain what is biased? The bias from the rng as a quality of
implementation, or the method of iterating over the array?
     Read Bentley and McIlroy's "Engineering a Sort Function."
(Note that their final version of qsort() sometimes fails to
satisfy 7.20.5p2.  I don't see the reason for 7.20.5p2, and the
Rationale is unenlightening, but it's there nonetheless.)

Thanks for the link. I'm pretty rusty on algorithm stuff now compared
to when I learned it in college.
 
C

Chris M. Thomasson

Ben Bacarisse said:
Why are you asking me? I can answer these questions if you like, but
what is the point of them?

It doesn't matter how many tests are done. I suspect, that since the
bug invokes UB, Chris's test program can't find it (at least it can't
be assured of finding it).

It will only be able to find the bug if the resulting undefined behavior
ends up spitting out a string that is different than the expected string.
Unfortunately, I find Edward's code a bit cryptic and am having trouble
manually finding the bug which would create a broken result. Can you give me
a hint? lol.

;^)



My point (such as it was) is that the
number of random tests passed bears little relation to the confidence
one should have in a piece of software.

Agreed. However, it did "help me" find a couple of bugs in a sub-string
search algorithm. Apparently, I was not able to manually construct a string
that broke it. Luckily, the random testing did create a couple of them. I
hope I did not imply that passing 10,000,000 tests means that the algorithm
is 100% bug free. If I did, well, I apologize for that total non-sense!
 
M

Malcolm McLean

On 3/1/2010 10:33 AM, ImpalerCore wrote:
for each i in array
   create random index k between 0 and array size
   swap array with array[k]

     Note that this scrambling algorithm is "random," but biased:
Some rearrangements are more likely than others.

Can you explain what is biased?  The bias from the rng as a quality of
implementation, or the method of iterating over the array?

The first element should have a 1/N chance of remaining in place.
Initially it does, you swap it with a random element, which may
include itself.
However you then subject it to further potential swaps. However
there's no way of it getting back to the first position if it is
swapped by one of these swaps. (I think, as I work it out in my head,
if there is a way it's vanishingly improbable). So its chance of
remaining in first place is less than 1/N. Quite substantially less,
in fact, the chance the the first element will be chosen in a random
swap is quite high. p = 1 - ((N-1)/N)^N, which works out at about 0.63
if N is 100.
 
I

ImpalerCore

On Mar 1, 8:51 am, Malcolm McLean <[email protected]>
wrote:
On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?
for each i in array
create random index k between 0 and array size
swap array with array[k]

I have a recursive (possibly naive) implementation of quicksort that
I've been using, but I haven't exercised it on data sets large enough
to generate stack overflows (yet). However, I do run into the use
case where quicksort is passed almost sorted data. I'm just wondering
if this is a viable (if hackish) method to encourage better average
performance for almost sorted data sets.

to write shuffle (untested)

void shuffle(void *data, size_t N, size_t width)
{
size_t i;
size_t target;

for(i=0;i<N-1;i++)
{
target = (size_t) ((rand()/ (RAND_MAX + 1.0)) * (N-i) + i);
memswap( (unsigned char *)data + (i*width), (unsigned char *)data
+(target *width), width);
}

}

Note deliberate bug in this code. Anyone spot it?
Another snag is that memswap isn't a standard library function. In
fact it's hard to provide an efficient, portable implementation.


I didn't notice the bug from just observation of the code.

rand() - number between 0 and RAND_MAX
RAND_MAX + 1.0 - convert to floating point
x = rand() / (RAND_MAX + 1.0) - should be floating point range between
0 <= x < 1

if i == 0
target = (x) * N, range 0 <= target < N

if i == N-2
target = (x) * 2 + N-2, range N-2 <= target < N

It seems ok, so I'm obviously missing the error.

However, running some test samples seemed to show that the target for
i = 0 never changed with different seeds. I'm not familiar enough
with rand to know why that is. Or maybe I missed the bug.
Quicksort needs a pivot. The most convenient pivot is the first item
in the array. The problem is that if you choose it, and the data is
sorted or almost sorted, the tree always has one brach with a single
cell. Thus stack overflow is likely.
The generally accepted way to get round the problem is by choosing
three random candidates for pivot, then taking the middle one.

I used the "randomizer" when I needed to merge two sorted arrays of
varying sizes into a single sorted array, and the pivot I chose in my
quicksort is the first element in the partition. I'll look into
modifying the algorithm to use a random pivot.

Thanks,
John D.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top