substring finding problem!

N

Nick

Ben Bacarisse said:
I've done the same and I get very different results for Chris's code
and for yours. Yours is the fastest on long strings, presumably
because you use strstr. Chris's is next fastest on long strings.


Yes, though it is reasonably predictable. For example, since Chris's
code is good for long strings and slow for short ones.

You get little difference between Chris's code and mine, but I get a
huge factor. Could just be a difference in hardware or cache/memory
configuration.

You're both using gcc, but are you using the same library? Someone
posted earlier about an implementation of strstr that uses Boyer-Moore
for long strings. The BSD implementation I found on the web uses strchr
to find the first character of the match, then strncmp to see if they
are the same. So if one of you has one of these and the other the other
....
 
B

Ben Bacarisse

Nick said:
You're both using gcc, but are you using the same library? Someone
posted earlier about an implementation of strstr that uses Boyer-Moore
for long strings. The BSD implementation I found on the web uses strchr
to find the first character of the match, then strncmp to see if they
are the same. So if one of you has one of these and the other the other
...

Good point. It could certainly be a library issue. My library's
strstr does not use Boyer-Moore, but it does use an enhanced search
algorithm that really pays off for long strings.
 
B

bartc

Ben Bacarisse said:
I assume these results are for 4004 byte long strings with 2 short
replacements. If so, here are my comparative timings:
thomasson (O2) 4.08 seconds
cmt_replace(`cat d4004`, "[]", "xx"):
4162900 calls in 4.000s is 960.8ns/call (9.608e-07s/call)
thomasson (O3) 4.08 seconds
cmt_replace(`cat d4004`, "[]", "xx"):
4202100 calls in 4.000s is 951.9ns/call (9.519e-07s/call)

Etc.

I haven't looked at the code involved (I've lost my way around the thread
and the routines anyway seem to be getting complex). But are you replacing
the 2-char "[]" string with a 2-char "xx" string?

That would suggest a simple optimisation (since the result must be the same
length as the original string?) so wouldn't a replacement string that is
longer and/or shorter than the old one be better?

(So that you don't compare an implementation that does optimise, with one
that doesn't.)
 
B

Ben Bacarisse

bartc said:
Ben Bacarisse said:
I assume these results are for 4004 byte long strings with 2 short
replacements. If so, here are my comparative timings:
thomasson (O2) 4.08 seconds
cmt_replace(`cat d4004`, "[]", "xx"):
4162900 calls in 4.000s is 960.8ns/call (9.608e-07s/call)
thomasson (O3) 4.08 seconds
cmt_replace(`cat d4004`, "[]", "xx"):
4202100 calls in 4.000s is 951.9ns/call (9.519e-07s/call)

Etc.

I haven't looked at the code involved (I've lost my way around the
thread and the routines anyway seem to be getting complex). But are
you replacing the 2-char "[]" string with a 2-char "xx" string?
Yes.

That would suggest a simple optimisation (since the result must be the
same length as the original string?) so wouldn't a replacement string
that is longer and/or shorter than the old one be better?

(So that you don't compare an implementation that does optimise, with
one that doesn't.)

That particular test is not mine. I was duplicating B L Massingill's
data so I had to use the same pattern, but I did check that this case
is not treated in any special way.
 
B

blmblm

I said last week that my replace function integrates two different
things.

How does that address my question (about the claim that Seebs's
proposed approach to your challenge was "unethical behavior")?

[ snip ]
It contains a power of two in each element in the scenario where you
wouldn't use * you would use shift. Look, I was just having a little
fun with Seebie.

And giving the impression that perhaps you weren't quite sure what
matrix multiplication was. Well, whatever.

I'm having a little trouble thinking of situations in which I would
use the shift operator when the desired operation is conceptually a
multiplication (as opposed to a bit shift); to me this seems like
a micro-optimization that would make the code more difficult to
understand without necessarily making it faster. *Maybe* if I had
evidence that the speed of this particular operation was critical,
and that the compiler was not already producing shift instructions
(as, experiment suggests, happens at least sometimes) ....

I'm having even more trouble thinking of how one might express
multiplication by a matrix all of whose elements are powers of two
(would the entries be the exponents? -- i.e., if element a[0][0]
was meant to represent 16, would one encode that as 16, or as 4?),
but -- whatever.
 
B

blmblm

For reference, I pulled out your V3 and ran the same sets of tests I
just posted again Richard Harter's second version. Here are the times
I got:

(Very interesting, and I'm trying to keep track of all of these
results so I can look over them when I have the time/energy to try
to understand them. :)? )

[ snip ]
If you want to include my current favourite in you tests it is:

For the record -- yes, this is the version of your code I've been
using.

[ snip ]
 
B

blmblm

Interesting!

You're both using gcc, but are you using the same library? Someone
posted earlier about an implementation of strstr that uses Boyer-Moore
for long strings. The BSD implementation I found on the web uses strchr
to find the first character of the match, then strncmp to see if they
are the same. So if one of you has one of these and the other the other
...

Times I've posted thus far have been collected on an oldish and
slowish machine, using gcc 4.0.1 and glibc 2.3.5. Details of
hardware configuration on request. I'll try my tests again on
a newer/faster machine, maybe, and report back.
 
B

blmblm

Thanks. Useful to compare out results.

Indeed (though I'm having a little trouble keeping up, and understanding
what they mean .... ) .

[ snip ]
I used -O3 throughout. I'll post the -O2 with -O2 as well but in my
case I get faster time in all cases using -O3 (gcc 4.4.1).

My results posted thus far were with gcc 4.0.1 (and glibc 2.3.5).
When I repeat the experiment, on a newer/faster system, with
gcc 4.4.1 (and glibc 2.10.1), I also find that -O3 is *almost*
always faster (and the one exception -- perhaps I need to rerun
the test, in case there was something else going on on the machine
that skewed results). Results below.
<snip>
I assume these results are for 4004 byte long strings with 2 short
replacements.

No, this is with my full test suite (such as it is):

4004-byte input, 2 short replacements (2/2)

4020-byte input, 10 short replacements (2/2)

4100-byte input, 50 short replacements (2/2)

4500-byte input, 50 slightly-longer replacements (10/10)

(Differences in input length are because I was lazy in coding
the generate-test-data function -- its inputs are the length of
the unchanged text, the length of the old/new text, and how many
times to repeat .... )

I've been posting total times only to save space and haven't been
looking carefully at whether code that's faster overall is also
faster for each individual test.

Results with gcc 4.4.1, as described above .... :

bacarisse (O2) 1.74 seconds
bacarisse (O3) 1.75 seconds
blmblm (O2) 2.52 seconds
blmblm (O3) 2.48 seconds
harter-1 (O2) 2.58 seconds
harter-1 (O3) 2.49 seconds
harter-2 (O2) 2.27 seconds
harter-2 (O3) 2.24 seconds
io_x (O2) 9.82 seconds
nilges (O2) 2.36 seconds
nilges (O3) 2.35 seconds
thomasson (O2) 1.69 seconds
thomasson (O3) 1.68 seconds
willem (O2) 2.77 seconds
willem (O3) 4.16 seconds [ can this be right?! ]
 
B

blmblm

[ snip ]
And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:

replace("banana", "ana", "ono")

IF you restart one position after the find point, and not at its end.

Why would you do that, though? only if you *wanted* to detect
overlapping strings, and -- if you did detect them, what would
you do with them? I can't think of any sensible definition of
"replace" that does anything with overlapping strings [*], so
when I wrote my first solution to this problem, I of course used
strstr and of course started scanning for the next match after
the end of the previous match.

[*] Chris Thomasson's reply to your post points out the ambiguities.
Moral: don't let the library do your thinking for you.


Mostly I'm replying to this rather old post, though, because it
seems as good a place as any to attempt a more-or-less formal
specification of the problem, which I'm not sure we have, and
which might be interesting. (Apologies if I missed one somewhere.)

Here's my proposed specification, in which "is not a substring of"
and "concat" have what I hope are obvious meanings, and names
beginning s_ denote strings:

replace(s_input, s_old, s_new) yields

if s_old is not a substring of s_input

s_input

else

concat(s_input_prefix, s_new, replace(s_input_tail, s_old, s_new))

where s_input_prefix and s_input_tail are such that

s_input = concat(s_input_prefix, s_old, s_input_tail)

and

s_old is not a substring of s_input_prefix

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.
 
S

spinoza1111

spinoza1111  said:
How does that address my question (about the claim that Seebs's
proposed approach to your challenge was "unethical behavior")?

Because he hasn't once shown us he can code to the problem statement
and not use string.h. Whilst calling people undeserved names.
[ snip ]




It contains a power of two in each element in the scenario where you
wouldn't use * you would use shift. Look, I was just having a little
fun with Seebie.

And giving the impression that perhaps you weren't quite sure what
matrix multiplication was.  Well, whatever.

No, I think I do. A matrix can be multiplied by a scalar, a vector, or
a matrix. Basic finite math. It's been years but I was awake, and I
respected my professors.
I'm having a little trouble thinking of situations in which I would
use the shift operator when the desired operation is conceptually a
multiplication (as opposed to a bit shift); to me this seems like

It's a common compiler optimization in fact.
a micro-optimization that would make the code more difficult to

Not to me. I do it all the time. Shifting n bits to the left and not
rotating is how you get to *2**n: shifting n bits to the right, and
not rotating, is how you get to *2**(-n). It even works when n is
zero.
understand without necessarily making it faster.  *Maybe* if I had
evidence that the speed of this particular operation was critical,
and that the compiler was not already producing shift instructions
(as, experiment suggests, happens at least sometimes) ....

Ah, so you know (of course). See, I was fucking with Seebach to see if
he could think for a change outside of the box. I know that "tools" do
all this stuff, but someone's got to reinvent and maintain the wheel.
Otherwise...flat tires.
I'm having even more trouble thinking of how one might express
multiplication by a matrix all of whose elements are powers of two
(would the entries be the exponents? -- i.e., if element a[0][0]
was meant to represent 16, would one encode that as 16, or as 4?),
but -- whatever.

No, I just want to multiply every element, no matter what its value,
by a power of two:

for(i = 0; i < rows; i++)
for(j = 0; i < columns; j++)
matrix[j] = matrix[j] << n;

Think that's right. Sure, the compiler MIGHT use some fairly
sophisticated analysis to know that in

matrix[j] *= 2**n;

the operation can be a shift, but it must know whether n is greater
than or less than zero. If n is unsigned, cool, but that's one more
thing for the compiler to worry about.

I'm no big fan of these micro-optimizations, however, because I prefer
a nobler goal, which is the complete destruction of the very idea of
terminating a string with a Nul. Basically, I was messing with Seeb's
head.
 
S

Seebs

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.

I don't think it is.

I would have specced it as:
echo "$STRING" | sed -e "s/$OLD/$NEW/g"

With the caveat that you have to overlook questions like "what if $OLD has
slashes in it".

But yeah, that seems to be the same spec. And really, the overlapping
strings question is just plain stupid. The output of a replacement isn't
rescanned, and the replacement of a given substring removes its content
from further consideration, so there is nothing to do about an alleged
overlapping string. Looking for "aba" in "dababa", you find "d*aba*ba",
you replace it with whatever, and what's left is "ba". Nothing special
or complicated.

-s
 
R

Richard Bos

Ben Pfaff said:
Funny, I see posters asking for code to do dumb things like this
regularly.

Yes, there is an astounding number of incompetent programming teachers,
and an even greater number of even worse programming students, out
there. (Many of them, apparently, in India - that country's programming
culture does not match the joyfulness of its kitchen). But that does not
mean that we should take them, or spinoza1111, seriously.

Richard
 
R

Richard Bos

Nick Keighley said:
The scanf function is basically a mess, and is rarely used correctly. I
am at a loss to understand why it is introduced so early in programming
texts.

a former clc regular once posted

***
The fscanf equivalent of fgets is so simple
that it can be used inline whenever needed:-
char s[NN + 1] = "", c;
int rc = fscanf(fp, "%NN[^\n]%1[\n]", s, &c);
if (rc == 1) fscanf("%*[^\n]%*c);
if (rc == 0) getc(fp);
***

That sounds suspiciously like either Dan Pop, who had a bee in his
bonnet regarding scanf() (amongst others), or like someone employing
irony at said Mr. Pop.
I think scanf() is seen as a straight forward way to read simple
unvalidated input. I'm not convinced that's a good idea.

It's a very bad idea. sscanf() can be a reasonable way to read simple
_validated_ input. Unvalidated, none of that family is useable.

Richard
 
C

Chris M. Thomasson

<0.0ddf9db924489b8a3417.20100222180827GMT.87wry5b0tg.fsf@bsb.me.uk>, [...]

I've been posting total times only to save space and haven't been
looking carefully at whether code that's faster overall is also
faster for each individual test.

Results with gcc 4.4.1, as described above .... :

bacarisse (O2) 1.74 seconds
bacarisse (O3) 1.75 seconds
blmblm (O2) 2.52 seconds
blmblm (O3) 2.48 seconds
harter-1 (O2) 2.58 seconds
harter-1 (O3) 2.49 seconds
harter-2 (O2) 2.27 seconds
harter-2 (O3) 2.24 seconds
io_x (O2) 9.82 seconds
nilges (O2) 2.36 seconds
nilges (O3) 2.35 seconds
thomasson (O2) 1.69 seconds
thomasson (O3) 1.68 seconds
willem (O2) 2.77 seconds
willem (O3) 4.16 seconds [ can this be right?! ]

FWIW, I always try to "prime" the program by making several dummy runs the
same data set. After that, I time several iterations of it on the same data
set, and report the average. Humm, when I get some time, I think I will
create another "entry" that uses a non-naive sub-string search algorithm:


http://groups.google.com/group/comp.lang.c/msg/825cf5bd46ad4a9f


I think it will be interesting to see how it effects your timing results.
 
S

spinoza1111

Yes, there is an astounding number of incompetent programming teachers,
and an even greater number of even worse programming students, out
there. (Many of them, apparently, in India - that country's programming
culture does not match the joyfulness of its kitchen). But that does not

In fact, I've worked at IBM and elsewhere with Indian developers, and
I've trained them in Fiji. They work harder, more intelligently and
have more intellectual curiosity than white American developers, who
are ready as we see here to defend their crap to the death rather than
learn. It is racist to praise their cooking and condemn their
programming, because they are on average better at programming than
most people here.
 
S

Seebs

Race appears to have little or nothing to do with programming ability,
but to say that most Indian programmers are dreadful at programming is
perfectly accurate - India is not immune from Sturgeon's Law.

People who have dealt with outsourcing are very likely to have encountered
a disproportionately bad sample of programmers from some other country.
I have not known anyone who has had particularly good experiences with
"cheap offshore coders", and I have oodles of horror stories.

On the other hand, I have a largeish number of Chinese coworkers. They're
nearly all novice-level or newbie programmers, but they're good people to
work with -- a huge step up from the outsourcing place we used to use.

I think a big part of the problem is that people are carefully set up to
fail -- I'm a pretty decent programmer by many standards, and I could not
produce adequate code from some of the specs I've seen. Giving these
tasks to novice programmers who have to do a full day's work without any
chance to ask for clarification or feedback... Well, I can imagine ways
this could work poorly.

-s
 
S

spinoza1111

[ snip ]
And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:
replace("banana", "ana", "ono")
IF you restart one position after the find point, and not at its end.

Why would you do that, though?  only if you *wanted* to detect
overlapping strings, and -- if you did detect them, what would
you do with them?  I can't think of any sensible definition of
"replace" that does anything with overlapping strings [*], so
when I wrote my first solution to this problem, I of course used
strstr and of course started scanning for the next match after
the end of the previous match.

[*] Chris Thomasson's reply to your post points out the ambiguities.
Moral: don't let the library do your thinking for you.

Mostly I'm replying to this rather old post, though, because it
seems as good a place as any to attempt a more-or-less formal
specification of the problem, which I'm not sure we have, and
which might be interesting.  (Apologies if I missed one somewhere.)

Here's my proposed specification, in which "is not a substring of"
and "concat" have what I hope are obvious meanings, and names
beginning s_ denote strings:

replace(s_input, s_old, s_new) yields

if s_old is not a substring of s_input

  s_input

else

  concat(s_input_prefix, s_new, replace(s_input_tail, s_old, s_new))

  where s_input_prefix and s_input_tail are such that

    s_input = concat(s_input_prefix, s_old, s_input_tail)

  and

     s_old is not a substring of s_input_prefix

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.
 
S

spinoza1111

[ snip ]
And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:
replace("banana", "ana", "ono")
IF you restart one position after the find point, and not at its end.

Why would you do that, though?  only if you *wanted* to detect

Search me. But that's what the code I was discussing actually did.
overlapping strings, and -- if you did detect them, what would
you do with them?  I can't think of any sensible definition of
"replace" that does anything with overlapping strings [*], so

replace(banana, ana, ono) could equal

bonona going left to right without overlap
banono going right to left without overlap
bonono going both ways with overlap

The third case could arise in natural language processing.

Suppose that in some language, the ana sound is transformed into the
ono sound to transform present into past tense (weirder things
happen), and suppose speakers do this to ALL occurences of the three
tones a, voiced n, a. When the sounds are adjacent they are
nonetheless distinct in speech but not in writing.

Now, the response of most garden-variety "break room" programmers is
"that's bullshit, and can never happen". But we know that in
programming, many strange things can happen, and that as Hamlet
admonished Horatio, we must "as a stranger give it welcome". Many more
strange things can happen outside programming, and programmers, even
of the Hooters or break room ilk, better realize this when programming
is used to solve problems.
when I wrote my first solution to this problem, I of course used
strstr and of course started scanning for the next match after
the end of the previous match.

[*] Chris Thomasson's reply to your post points out the ambiguities.
Moral: don't let the library do your thinking for you.

Mostly I'm replying to this rather old post, though, because it
seems as good a place as any to attempt a more-or-less formal
specification of the problem, which I'm not sure we have, and
which might be interesting.  (Apologies if I missed one somewhere.)

Here's my proposed specification, in which "is not a substring of"
and "concat" have what I hope are obvious meanings, and names
beginning s_ denote strings:

replace(s_input, s_old, s_new) yields

if s_old is not a substring of s_input

  s_input

else

  concat(s_input_prefix, s_new, replace(s_input_tail, s_old, s_new))

  where s_input_prefix and s_input_tail are such that

    s_input = concat(s_input_prefix, s_old, s_input_tail)

  and

     s_old is not a substring of s_input_prefix

This is fine as long as we understand your concat as NOT specifying
left to right or right to left direction. The bonono problem would
have to be handled by preprocessing (translating banana into banaana,
perhaps using a rule that "vowels" (in the language) can always be
duplicated because their sounds don't break.

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

How about "does not use string.H and gets rid of the Obscene
Excrescence: the use of Nul to terminate a string, thereby creating a
new C that is almost fit for use".

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.

No. The API locks us into bad thoughts.
 
S

spinoza1111

People who have dealt with outsourcing are very likely to have encountered
a disproportionately bad sample of programmers from some other country.
I have not known anyone who has had particularly good experiences with
"cheap offshore coders", and I have oodles of horror stories.

So do most incompetent programmers. The reason being that they are not
only bad at coding, they also do a bad job of problem specification,
and as technically educated people, they know nothing of the culture
of the offshore programmers, apart from racist
stereotyping...including the patronizing admiration for cuisine, which
relegates the people of the land, and cultures that are normally far
older than Europe, to a feminized, subordinate and servant role.

On the other hand, I have a largeish number of Chinese coworkers.  They're
nearly all novice-level or newbie programmers, but they're good people to
work with -- a huge step up from the outsourcing place we used to use.

I think a big part of the problem is that people are carefully set up to
fail -- I'm a pretty decent programmer by many standards, and I could not

Not by mine.
 
B

blmblm

[ snip ]
No, I think I do. A matrix can be multiplied by a scalar, a vector, or
a matrix. Basic finite math. It's been years but I was awake, and I
respected my professors.


It's a common compiler optimization in fact.


Not to me. I do it all the time. Shifting n bits to the left and not
rotating is how you get to *2**n: shifting n bits to the right, and
not rotating, is how you get to *2**(-n). It even works when n is
zero.


Ah, so you know (of course).

You may give me too much credit. I do have advanced degrees
in CS, but my undergraduate degree is in another field, and --
well, for one reason and another there are courses that probably
would enhance my understanding of my field but that I simply have
never taken, and the list of things I wish I knew more about (and
arguably *SHOULD* know more about) is -- long.
See, I was fucking with Seebach to see if
he could think for a change outside of the box.

As best I can tell, he doesn't read anything you're posting here
unless someone else quotes it -- or at least he doesn't reply,
so your taunting .... ah well, I suppose now that I've quoted
you he may choose to reply. Or not.

I would be very surprised to hear that anyone who talks about
code as he does would *not* be aware of this particular trick.
Inventing it for oneself might require "thinking outside the box";
being aware that it exists -- not so much.
I know that "tools" do
all this stuff, but someone's got to reinvent and maintain the wheel.
Otherwise...flat tires.
I'm having even more trouble thinking of how one might express
multiplication by a matrix all of whose elements are powers of two
(would the entries be the exponents? -- i.e., if element a[0][0]
was meant to represent 16, would one encode that as 16, or as 4?),
but -- whatever.

No, I just want to multiply every element, no matter what its value,
by a power of two:

*But this is not matrix multiplication* -- or at least, not as I
define the term. Curiously enough, however, Wikipedia seems to
agree with you that "matrix multiplication" can mean multiplying
a matrix by either another matrix or a scalar. So my attempt to
nitpick here may be off-base. Of course, elsethread you don't
seem to find Wikipedia a credible source. "Whatever."
for(i = 0; i < rows; i++)
for(j = 0; i < columns; j++)
matrix[j] = matrix[j] << n;

Think that's right. Sure, the compiler MIGHT use some fairly
sophisticated analysis to know that in

matrix[j] *= 2**n;


"**" for exponentation is Fortran, not C. (As you may know.)
the operation can be a shift, but it must know whether n is greater
than or less than zero. If n is unsigned, cool, but that's one more
thing for the compiler to worry about.

gcc seems to be reasonably smart about generating shifts rather
than multiplies if n is something that's known at compile time,
even though (as far as I know) one must use the "pow" function
to express exponentiation in C.
I'm no big fan of these micro-optimizations, however, because I prefer
a nobler goal, which is the complete destruction of the very idea of
terminating a string with a Nul. Basically, I was messing with Seeb's
head.

I don't plan to assist you with reaching either goal -- not on purpose
anyway. (Hm.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,102
Messages
2,570,645
Members
47,246
Latest member
TemekaLutz

Latest Threads

Top