what the better way to find a "string" ?

J

James Kuyper

So when i need find a "string" i use "strstr()", but this form not seems better form...

Some peoples talk me about bitap algorithm:
http://en.wikipedia.org/wiki/Bitap_algorithm

That page describes the algorithm as being designed to find approximate
string matches. strstr() is required to provide an exact match, which is
a simpler problem. If an exact match is what you're looking for,
strstr() is likely to be faster (depending the competence of the
implementors of the C standard library that you're using).

That page also describes a variant of the Bitap algorithm specialized
for exact matches. I've no idea how good that algorithm is, compared to
other available exact-match algorithms. However, if it is indeed a
superior algorithm, there's a decent chance that the strstr() function
provided by the C standard library installed on your machine might
actually already be using that algorithm.
Someone know another method to find string ?
(my priority of research is fast algorithms)

Follow the link on that Wikipedia page to "string searching" for an
article on known algorithms.
 
K

Ken Brody

So when i need find a "string" i use "strstr()", but this form not seems better form...

Some peoples talk me about bitap algorithm:
http://en.wikipedia.org/wiki/Bitap_algorithm

Someone know another method to find string ?
(my priority of research is fast algorithms)

The bitap algorithm is not equivalent to strstr(), because:

The algorithm tells whether a given text contains a substring which
is "approximately equal" to a given pattern

Note the "approximately equal" qualifier.

Perhaps you want to start here?

http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm

Or here?

https://www.google.com/search?q=fast+substring+matching
 
E

Eric Sosman

So when i need find a "string" i use "strstr()", but this form not seems better form...

Some peoples talk me about bitap algorithm:
http://en.wikipedia.org/wiki/Bitap_algorithm

The C language specifies strstr() in terms of its behavior,
not in terms of any particular implementation. It is possible
that some implementations of strstr() might use a Bitap approach.
That seems unlikely, though, because Bitap needs extra memory
proportional to the length of the string to be matched -- and
since the second argument to strstr() could be of arbitrary length,
dynamic allocation would appear to be necessary. If an internal
malloc() call returned NULL or if the pattern string's length
exceeded whatever amount had been set aside in advance, strstr()
would be unable to execute Bitap and would have to fall back on
some other algorithm.
Someone know another method to find string ?
(my priority of research is fast algorithms)

There are at least three quite different situations you should
consider when assessing how "fast" your strstr() replacement is:

- Repeated searches for the same pattern in different strings,
like looking for ".co.uk" in thousands of collected URL's.
An algorithm that spends up-front time preprocessing the
pattern might perform well here.

- Repeated searches in the same string for different patterns,
like treating all of "War and Peace" as one string and looking
through it for various names or phrases. Preprocessing the
target string might be a win for this one.

- Searching for arbitrary patterns in arbitrary strings, without
foreknowledge of either. This is the strstr() design, more or
less: You hand it a pair of strings that may have nothing or
much to do with other pairs. Preprocessing is less likely to
pay off because you don't get to amortize its cost over many
searches. You might preprocess and memoize in hopes of seeing
one or the other string again, but you risk wasting the effort.

Finally, algorithms and implementations are "fast" or "slow"
in different senses. You can talk about the "time complexity" of
an algorithm, but that's not the same as talking about the "speed"
of an actual implementation on an actual machine.
 
K

Kaz Kylheku

So when i need find a "string" i use "strstr()", but this form not seems better form...

Some peoples talk me about bitap algorithm:
http://en.wikipedia.org/wiki/Bitap_algorithm

strstr is an API, not an algorithm.

So you're kind of comparing apples and oranges.

strstr could internally do any number of things: naive scan, KMP
(Knuth-Morrisson-Pratt), Boyer-Moore ...
Someone know another method to find string ?
(my priority of research is fast algorithms)

As has been noted, this algorithm describes semantics other than strstr, which
finds exact substrings. If your program depends on strstr, a bitap function
is not suitable, even if it has the same type signature: two const char *
parameters, and const cahr * return.

If you're after a (portably, predictably) fast strstr, start by looking into the
aforementioned algorithms: KMP, Boyer-Moore.
 
M

Malcolm McLean

So when i need find a "string" i use "strstr()", but this form not seems better form...



Some peoples talk me about bitap algorithm:

http://en.wikipedia.org/wiki/Bitap_algorithm

Someone know another method to find string ?
(my priority of research is fast algorithms)
Use a suffix tree if you want to search for many sub-strings in the same
set of strings (either one long string or a big collection of short ones).
 
G

glen herrmannsfeldt

The C language specifies strstr() in terms of its behavior,
not in terms of any particular implementation. It is possible
that some implementations of strstr() might use a Bitap approach.
That seems unlikely, though, because Bitap needs extra memory
proportional to the length of the string to be matched -- and
since the second argument to strstr() could be of arbitrary length,
dynamic allocation would appear to be necessary. If an internal
malloc() call returned NULL or if the pattern string's length
exceeded whatever amount had been set aside in advance, strstr()
would be unable to execute Bitap and would have to fall back on
some other algorithm.

OK. But, unless I forgot, Boyer-Moore doesn't have memory needs
that depend on the string sizes. So one could use Boyer-Moore.
There are at least three quite different situations you should
consider when assessing how "fast" your strstr() replacement is:
- Repeated searches for the same pattern in different strings,
like looking for ".co.uk" in thousands of collected URL's.
An algorithm that spends up-front time preprocessing the
pattern might perform well here.

But if implemented in strstr(), that preprocessing would be
done each time. That is one reason against using Boyer-Moore.

Someone might have only small strings to compare, such that an
algorithm that did preprocessing would be slower.
- Repeated searches in the same string for different patterns,
like treating all of "War and Peace" as one string and looking
through it for various names or phrases. Preprocessing the
target string might be a win for this one.

But usually a special case, where a general routine like strstr()
wouldn't apply. One, for example, might search DNA strings, where
only 'A', 'C', 'T', and 'G' occur. strstr() would still work, but
might be less than optimal.
- Searching for arbitrary patterns in arbitrary strings, without
foreknowledge of either. This is the strstr() design, more or
less: You hand it a pair of strings that may have nothing or
much to do with other pairs. Preprocessing is less likely to
pay off because you don't get to amortize its cost over many
searches. You might preprocess and memoize in hopes of seeing
one or the other string again, but you risk wasting the effort.

Are there known strstr() implementations, distributed for general
use (that is, not counting one someone wrote for only their use)
that use other than the usual brute-force search?
Finally, algorithms and implementations are "fast" or "slow"
in different senses. You can talk about the "time complexity" of
an algorithm, but that's not the same as talking about the "speed"
of an actual implementation on an actual machine.

Even more, the usual complexity big-Oh notation only applies in the
large N limit. In the case of strstr(), you have two lengths to
consider. Which one is N?

-- glen
 
M

Malcolm McLean

Even more, the usual complexity big-Oh notation only applies in the
large N limit. In the case of strstr(), you have two lengths to
consider. Which one is N?
With the obvious, straightforwards implementation of strstr, you're O(N)
in the haystack. Then assuming random strings, you've got an odd O in
the needle which is effectively O(1) (you've got 1/Nchars chance of a hit
in the first position, 1/Nchars in the second etc, so basically you're
going to terminate early). However we seldom want to search for a random
string in a random string. If you're looking for "theocratic" in English
text you're going to get a lot of hits on the sequence t-h-e. In degenerate
cases, that could totally kill performance.
 
E

Eric Sosman

OK. But, unless I forgot, Boyer-Moore doesn't have memory needs
that depend on the string sizes. So one could use Boyer-Moore.

I think it requires memory proportional to the length of
the "needle" (not the "haystack") string. In any event, a
strstr() implementation must eventually be able to use a method
requiring O(1) storage, in case it's unable to obtain more.
But if implemented in strstr(), that preprocessing would be
done each time. That is one reason against using Boyer-Moore.

Seems to me this would be the poster child for Boyer-Moore
(or another method that works by preprocessing the "needle").
You'd preprocess and memoize, and if you were fed the same needle
again you'd re-use the preprocessed stuff from before.

Still, I'd have to agree that this would probably not be a
splendid strstr() implementation. Were I conducting searches in
a pattern like this, I'd probably package it rather differently.
Someone might have only small strings to compare, such that an
algorithm that did preprocessing would be slower.


But usually a special case, where a general routine like strstr()
wouldn't apply. One, for example, might search DNA strings, where
only 'A', 'C', 'T', and 'G' occur. strstr() would still work, but
might be less than optimal.

"Less than optimal" -- Sounds like a polite understatement ...
Are there known strstr() implementations, distributed for general
use (that is, not counting one someone wrote for only their use)
that use other than the usual brute-force search?

Sorry; dunno.
Even more, the usual complexity big-Oh notation only applies in the
large N limit. In the case of strstr(), you have two lengths to
consider. Which one is N?

In the Bitap page to which the O.P. referred, the "needle" is
of length M and the "haystack" of length N. The running time (for
the exact-match version) is said to be O(M*N) -- which sounds bad,
but the implementation's constant factors are small. (The sample
implementation exhibited -- in not-so-great C, by the way --
disguises the M factor by insisting on M<=31 and using bit-mask
operations to do M steps at once.)
 
K

Kaz Kylheku

I think it requires memory proportional to the length of
the "needle" (not the "haystack") string. In any event, a
strstr() implementation must eventually be able to use a method
requiring O(1) storage, in case it's unable to obtain more.

Also, C programmers will tend to expect strstr to be reentrant.

I don't expect anything bad to happen if I call strstr in an asynchronous
signal handler or whatever, even if that went off in the middle of a call to
malloc, or another call to strstr or whatever.

Lok at POSIX: it gives no admonitions against strstr with regard to signal or
thread safety, and provides no "strstr_r" like is done with strtok and some
other functions.
 
M

Melzzzzz

So when i need find a "string" i use "strstr()", but this form not
seems better form...

Some peoples talk me about bitap algorithm:
http://en.wikipedia.org/wiki/Bitap_algorithm

Someone know another method to find string ?
(my priority of research is fast algorithms)


Cheers

Well I have made small benchmark of different substring search
algorithms and in general case best performers are SSE42 pcmpistri
pcmpestri instructions with brute force search. strstr and memmem are
pretty good, too. (gcc)
referent points are memmemopt which is optimized memmemsse2 (both in
asm) and strstrasm which is naive brute force asm.
strstrsse42 memmemsse42 are also asm routines that use sse42 string
instructions.

bmaxa@maxa:~/examples/assembler$ time ./strings
haystack alice.html 202167, needle (added to end) "this has to be found" 20

strstrsse42 0x7f0b697865cb 19.3
strstr 0x7f0b697865cb 31.0
memmemsse42 0x7f0b697865cb 38.6
strcasestr 0x7f0b697865cb 43.2
BM horspool 0x7f0b697865cb 58.2
bmhorspoolasm 0x7f0b697865cb 61.0
BM 0x7f0b697865cb 82.1
BM Turbo 0x7f0b697865cb 86.7
memmemsse2 0x7f0b697865cb 97.3
memmemopt 0x7f0b697865cb 103.9
strstrsse2 0x7f0b697865cb 106.2
memmem2 0x7f0b697865cb 181.9
memmem 0x7f0b697865cb 233.9
strstrasm 0x7f0b697865cb 262.1
string find 0x7f0b697865cb 270.1
KMP 0x7f0b697865cb 483.3
bitap 0x7f0b697865cb 5745.0

haystack 1048576 random data, needle 1024 random data

bmhorspoolasm (nil) 30.7
BM horspool (nil) 31.1
memmem (nil) 33.7
BM (nil) 38.5
BM Turbo (nil) 40.9
strstrsse42 (nil) 61.2
memmemopt (nil) 80.6
memmemsse2 (nil) 89.9
strstr (nil) 90.3
strstrsse2 (nil) 102.2
memmem2 (nil) 108.9
strcasestr (nil) 141.8
memmemsse42 (nil) 152.2
string find (nil) 689.0
strstrasm (nil) 827.9
KMP (nil) 1414.8
bitap (nil) 562742.6

haystack 1048576 random data, needle 7 random data

strstrsse42 (nil) 59.6
memmemopt (nil) 80.9
memmemsse2 (nil) 90.1
memmem2 (nil) 99.8
strstrsse2 (nil) 104.5
memmemsse42 (nil) 151.3
strstr (nil) 375.3
BM horspool (nil) 464.0
strcasestr (nil) 469.0
bmhorspoolasm (nil) 477.1
memmem (nil) 586.6
BM Turbo (nil) 633.2
BM (nil) 656.5
string find (nil) 684.3
strstrasm (nil) 825.2
KMP (nil) 1410.8
bitap (nil) 9032.7

haystack 1048593 8 byte repeated pattern, needle 9 byte data (begins with pattern)

strstrsse42 0x7f0b688f1030 345.0
BM horspool 0x7f0b688f1030 396.8
bmhorspoolasm 0x7f0b688f1030 410.5
memmemsse42 0x7f0b688f1030 449.8
memmem 0x7f0b688f1030 533.8
BM Turbo 0x7f0b688f1030 543.6
memmemsse2 0x7f0b688f1030 552.5
BM 0x7f0b688f1030 564.7
memmemopt 0x7f0b688f1030 573.1
strstrsse2 0x7f0b688f1030 585.3
strstr 0x7f0b688f1030 733.7
memmem2 0x7f0b688f1030 760.7
strcasestr 0x7f0b688f1030 910.5
string find 0x7f0b688f1030 995.7
strstrasm 0x7f0b688f1030 1677.3
KMP 0x7f0b688f1030 2748.7
bitap 0x7f0b688f1030 11363.7

haystack 1048580 1 byte repeated pattern, needle 4 byte data (begins with pattern)

strstrsse42 0x7f0b688f1028 393.4
strstr 0x7f0b688f1028 462.4
memmemsse42 0x7f0b688f1028 515.5
memmem 0x7f0b688f1028 529.9
strcasestr 0x7f0b688f1028 574.4
memmemopt 0x7f0b688f1028 799.2
bmhorspoolasm 0x7f0b688f1028 1787.4
BM horspool 0x7f0b688f1028 3190.6
KMP 0x7f0b688f1028 3380.5
string find 0x7f0b688f1028 3449.8
bitap 0x7f0b688f1028 3741.0
strstrasm 0x7f0b688f1028 4052.4
memmemsse2 0x7f0b688f1028 4443.1
BM Turbo 0x7f0b688f1028 4501.4
BM 0x7f0b688f1028 4565.5
strstrsse2 0x7f0b688f1028 4681.0
memmem2 0x7f0b688f1028 6867.1

haystack 1048576 1 byte repeated pattern, needle 4 byte data (all fail)

memmem2 (nil) 30.5
memmem (nil) 32.2
memmemopt (nil) 40.6
memmemsse2 (nil) 44.8
strstrsse42 (nil) 56.7
strstrsse2 (nil) 62.4
memmemsse42 (nil) 149.4
strstr (nil) 383.7
strcasestr (nil) 471.7
string find (nil) 646.4
strstrasm (nil) 793.4
BM horspool (nil) 801.0
bmhorspoolasm (nil) 805.7
BM (nil) 1133.2
BM Turbo (nil) 1326.4
KMP (nil) 1328.3
bitap (nil) 3114.4

cumulative list by time

strstrsse42 935.2
memmemsse42 1456.8
memmemopt 1678.2
memmem 1950.2
strstr 2076.4
strcasestr 2610.6
bmhorspoolasm 3572.6
BM horspool 4941.6
memmemsse2 5317.7
strstrsse2 5641.5
string find 6735.3
BM 7040.5
BM Turbo 7132.2
memmem2 8048.9
strstrasm 8438.3
KMP 10766.4
bitap 595739.5

cumulative list by place

strstrsse42 15
memmemsse42 36
memmemopt 37
BM horspool 37
memmem 38
strstr 39
bmhorspoolasm 40
memmemsse2 44
strcasestr 52
memmem2 57
strstrsse2 58
BM Turbo 60
BM 61
string find 77
strstrasm 82
KMP 89
bitap 96
 
E

Eric Sosman

Also, C programmers will tend to expect strstr to be reentrant.

Yeah, we tend to expect a lot -- sometimes, a lot more than
is actually promised. 7.1.4p4:

"The functions in the standard library are not guaranteed
to be reentrant and may modify objects with static or thread
storage duration. 188)"
I don't expect anything bad to happen if I call strstr in an asynchronous
signal handler or whatever, even if that went off in the middle of a call to
malloc, or another call to strstr or whatever.

"188) Thus, a signal handler cannot, in general, call
standard library functions."

Nonetheless, I admit to a feeling that some library functions
"should" be reentrant. Is there a good reason for strlen() to use
static storage? No, surely not -- so even though there's no such
promise, people (self included) will feel at liberty to call strlen()
in an asynchronous context. strchr()? memcpy()? Seems like the
same should apply. fopen(), malloc(), atexit()? Unnh, not likely.
qsort()? Some are reentrant, some are not, best not to ask ...

The C language does not guarantee reentrancy, nor even thread
safety, for strstr(). What we as "practical programmers" do with
that uncertainty is up to us -- and our employers, of course.
 
C

Cooler_x0a

Masters thanks 4 attention.

Melzzzzz
Nice research, benchmark this help me...

thanks very much!



cheers
 
8

88888 Dihedral

With the obvious, straightforwards implementation of strstr, you're O(N)

in the haystack. Then assuming random strings, you've got an odd O in

the needle which is effectively O(1) (you've got 1/Nchars chance of a hit

in the first position, 1/Nchars in the second etc, so basically you're

going to terminate early). However we seldom want to search for a random

string in a random string. If you're looking for "theocratic" in English

text you're going to get a lot of hits on the sequence t-h-e. In degenerate

cases, that could totally kill performance.

Well, there are variations of this
string search problem.

I'll give a more general form :

Perform a search of M strings
in a known fixed string of length N.

If there's no exact match found,
then the program can just return
"not-found".

A harder problem is to return
the closer match results by some
given criterias.
 
S

Stefan Ram

While you fine-tune your string search, the competition is
already releasing its product.

I'd like elaborate on that a little bit.

First, do not call strstr directly, but for all different
purposes use a wrapper. For example, to find a string in a
large text buffer define:

#include <string.h>
inline char * mylargebuffstrstr( const char * s1, const char * s2 )
{ return strstr( s1, s2 ); }

Now release version 0.5 with this implementation

This way, /if/ the string search /should/ be to slow, you can
always improve mylargebuffstrstr later, e.g., for version 1.0.
 
K

Kaz Kylheku

Yeah, we tend to expect a lot -- sometimes, a lot more than
is actually promised. 7.1.4p4:

"The functions in the standard library are not guaranteed
to be reentrant and may modify objects with static or thread
storage duration. 188)"

Good thing "thread-safe" isn't the same thing as "reentrant", or that C11
threading stuff would be doomed. :)

If strstr needs to malloc something, it can lock some big internal mutex around
malloc.
 
8

88888 Dihedral

Or use TLS, or just do a malloc and save the pointer in an automatic

(aka on the stack).

Well, is the O(M) algorithm optimal ?

Anyway O(M*ln(N) ) algorithms are
also useful for small M.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,817
Latest member
AdalbertoT

Latest Threads

Top