Help: what is the quickest way to find out whether a string contains another string?

T

tuweiwen

Hi all, can anyone give me a hint on what approach I should take to
this simple problem: what is the quickest way to find out whether a
string contains another string? For example,

String a = "a very long string, UP TO 100k";
String b = "a string of two or three words";

//---- now I need to write this function
boolean a_contains_b (String a, String b) {
//I don't need to worry about the location of b, just want to know
//whether a has b inside it. Returns true or false
}

This function will be called Thousands of times in an app.

What's best way to handle this to achieve best performance?

Currently I am using "if (a.indexOf(b) !=-1) " in the above
"a_contains_b()" function. But I don't like it at all. It feels dumb,
isn't it? Do I have the wrong approach at the outset?

Thanks a lot. I appreciate your help.
 
B

Brandon McCombs

Hi all, can anyone give me a hint on what approach I should take to
this simple problem: what is the quickest way to find out whether a
string contains another string? For example,

String a = "a very long string, UP TO 100k";
String b = "a string of two or three words";

//---- now I need to write this function
boolean a_contains_b (String a, String b) {
//I don't need to worry about the location of b, just want to know
//whether a has b inside it. Returns true or false
}

This function will be called Thousands of times in an app.

What's best way to handle this to achieve best performance?

Currently I am using "if (a.indexOf(b) !=-1) " in the above
"a_contains_b()" function. But I don't like it at all. It feels dumb,
isn't it? Do I have the wrong approach at the outset?

Thanks a lot. I appreciate your help.

Unless I'm missing something wouldn't the contains() method of String
class be what you want??

boolean contains(CharSequence s)
Returns true if and only if this string contains the
specified sequence of char values.
 
R

Rhino

Hi all, can anyone give me a hint on what approach I should take to
this simple problem: what is the quickest way to find out whether a
string contains another string? For example,

String a = "a very long string, UP TO 100k";
String b = "a string of two or three words";

//---- now I need to write this function
boolean a_contains_b (String a, String b) {
//I don't need to worry about the location of b, just want to know
//whether a has b inside it. Returns true or false
}

This function will be called Thousands of times in an app.

What's best way to handle this to achieve best performance?

Currently I am using "if (a.indexOf(b) !=-1) " in the above
"a_contains_b()" function. But I don't like it at all. It feels dumb,
isn't it? Do I have the wrong approach at the outset?

Thanks a lot. I appreciate your help.
One approach would be to use "regular expressions", often abbreviated
"regexp". If you look at the Pattern class in the J2SE API, you'll find an
overview describing how to use them. A full tutorial on regular expressions
can be found in the online version of the Java Tutorial at
http://java.sun.com/docs/books/tutorial/extra/regex/index.html.

Rhino
 
R

Roedy Green

Hi all, can anyone give me a hint on what approach I should take to
this simple problem: what is the quickest way to find out whether a
string contains another string? For example,

an approach is to do a quick check before you scan.

You break your big string into words with a regex split and put each
word into a HashSet.

Then you break you short string into words with that same precompiled
regex and look up each word in turn in the HashSet. If any word is
not found, there is no point in further processing. If you do find all
three, then use indexOf or contains.

If you want to get fancy, use a hashMap to track the offset of the
FIRST ocurrence of each word. Then you take the min of the three
offsets for starting off your indexof search and the max of the three
for your stopping search point.
 
T

tuweiwen

Thanks a lot, guys. Roedy, I will try that approach later.

I did a little testing on the performance and got an interesting
result. Here is the program.


import java.util.*;
/*
This is an un-scientific, probably way off-mark, testing
comparing the speed of three functions:
String.contains(), String.indexOf() and String.matches().
Note that String.matches() use regular expression approach
according to the java doc

Result: the speeds of indexOf() and contains() are about the same,
but matches() is at least ten times slower.
Is regular expression bad in terms of performance?

Problems with this tesing:
1. The target string a is not long enough. As I said, in my app, it
can be up to 100k. The performance may change if string a gets bigger.
2. In all loops, string a and string b are the same. In my app, at the
least string a is always changing, such as newly downloaded web pages.
3. Right now, it always matches inside the loop. should do a randomized
String a and string b.

Testing machine: WinXP Pro, P3 500, 256MB Ram, JDK 1.5.0_05
*/

public class Test {

public static void main(String args[]) {

int loop = 1000000;

int matches = 0;

String a = "This is a testing string. It is not very long.";
String b = "is not";

long startTime = System.currentTimeMillis();

for (int i = 0; i < loop; i++) {
if (a.contains(b)) matches++; //took about 1620 miliseconds
//if (a.matches(".*not.*")) matches++; //took about 25200
miliseconds
//if (a.indexOf(b) != -1) matches++; //took about 1520 miliseconds
}

long endTime = System.currentTimeMillis();

System.out.println("\nRunning " + loop + " times took " + (endTime -
startTime) + " miliseconds.");
}

}



right now, this part of my app (processing the downloaded web pages)
takes lots of time. so i am looking for ways to speed it up.
 
T

tuweiwen

This time I use explicit regular expression instead of
String.matches(), and pre-compile the regex. Speed gets better a little
bit.
Just replace the old main() method with this new one. Also need import
java.util.regex.*; at the top.


public static void main(String args[]) {
int loop = 1000000;
int matches = 0;
String a = "This is a testing string. It is not very long.";
String b = "is not";

Pattern pat = Pattern.compile(".*is not.*");
Matcher mat;

long startTime = System.currentTimeMillis();

for (int i = 0; i < loop; i++) {
//if (a.contains(b)) matches++; //took about 1620 miliseconds
//if (a.indexOf(b) != -1) matches++; //took about 1520
miliseconds
if (pat.matcher(a).find()) matches++; //took about 18400 milis

}
long endTime = System.currentTimeMillis();
System.out.println("\nRunning " + loop + " times took " + (endTime -
startTime) + " miliseconds.");
}
 
T

Thomas Weidenfeller

What's best way to handle this to achieve best performance?

"best" always depends on the context, the word as such is rather
meaningless. I assume you mean fast. Then, contrary to other
suggestions, regexps are not a good choice at all. We have discussed
this in the past.

Your problem is a standard problem in CS, it has been extensively
studied and many textbooks discuss it. Look for the following keywords
in books, Google, etc.:

Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.

There are for sure more such algorithms.

/Thomas
 
C

Chris Uppal

Result: the speeds of indexOf() and contains() are about the same,
but matches() is at least ten times slower.
Is regular expression bad in terms of performance?

Yes.

Regexp matching is an order of magnitude more complex than a simple scan of the
string. Although the time (for normal regexps) is linear in the length of the
String, just as for naive scanning, the constants are significantly greater --
as you have found.

Another problem, by the way, is that if you are looking for a string like "this
is *not* easy", then you'll have to pre-process the search string to escape the
special characters.

You've already been pointed at better algorithms for searching strings than a
naive scan. I just wanted to add a few warnings. One is that you should
verify that string matching is /actually/ the bottleneck in your application
before wasing too much time looking for alternatives. Another is that all
these algorithms have significant start-up time, so you should be careful to
measure using representative data -- a string literal that you can type into a
Java program is /not/ representative of a 100K String. (BTW it's distinctly
unusual to have such long strings in a program, how do they come above ?) The
last is that if you are looking at Roedy's BMP implementation, that is
artificially restricted[*] to chararacters in the range 0...255 in both the
search and target strings.

([*] Or it was when I last looked, and the restriction is -- in fact --
justifiable, but it may not rule out that implementation for your purposes.)

-- chris
 
Z

zero

Your problem is a standard problem in CS, it has been extensively
studied and many textbooks discuss it. Look for the following keywords
in books, Google, etc.:

Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.

I find my knowledge of such standard algorithms to be quite limited. Can
you suggest a good source - online or in print - to broaden my knowledge?

TIA
zero
 
C

Chris Uppal

I find my knowledge of such standard algorithms to be quite limited. Can
you suggest a good source - online or in print - to broaden my knowledge?

The wikipedia article:

http://en.wikipedia.org/wiki/String_searching_algorithm

seems like a fair starting point on string searching. It has links to the
wikipedia entries on BMP and RK algorithms, and som external links too.

If you'd rather read a book (or would like to read a book too), then I
recommend Robert Sedgewick's algorithm books as particularly readable surveys
of algorithms in common use. His 1-volumne "Algorithms in <language>" books
have several pages on string searching, and /lots/ of other stuff that you
might find interesting. He's also in the process of putting out an expanded
N-volume version of "Algorithms in <language>", unfortunately only two volumes
have come out so far, and I think his expanded coverage of string searching
won't appear until in vol.3 comes out (which Amazon seem to think will be in
next July).

Knuth is, of course, the ultimate algorithm book, but Sedgewick is also an
authoritative and knowledgable specialist -- and his books are a /lot/ easier
to read being (a) better explained and motivated, (b) aimed at practising
programmers rather than (near-)mathematicians, (c) much shorter...

-- chris
 
T

tuweiwen

Thomas thanks for the tip. I went on to download and run Roedy's Boyer
program and it works great. I used quite large strings of Chinese
characters (target string of size 7KB and pattern string of 50
character-long, to avoid it falling back to String.indexOf() . I am
running a Chinese version of WinXP) to test it, and it works too.
Thanks Roedy.

Chris, the strings are downloaded web pages in memory. What my app (a
spider) does is to find out if a page contains one of several pattern
strings, if it is, then it's saved to disk. I would like to know if my
current approach is an efficient one.
 
R

Roedy Green

Result: the speeds of indexOf() and contains() are about the same,
but matches() is at least ten times slower.
Is regular expression bad in terms of performance?

Don't you believe your own experiment?

In general the more specialised a tool is to the task at hand the
faster it will be.
 
C

Chris Smith

Rhino said:
One approach would be to use "regular expressions", often abbreviated
"regexp". If you look at the Pattern class in the J2SE API, you'll find an
overview describing how to use them.

Why would anyone want to use regular expressions to look for a plain
String? Regular expressions are a tool for deciding whether a String or
some subset thereof matches a defined regular language. Although string
searching is a trivial degeneration of the problem addressed by regular
expressions, there's no need to introduce the added complexity.

If you are set on using regular expressions, converting the arbitrary
input string into a pattern that matches only itself is the first task.
Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated
in a method called Pattern.quote(String). Once that's done, you can
search for that pattern. It's far easier, though, to use String.indexOf
(prior to 1.5) or String.contains (as of 1.5).

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
C

Chris Smith

Chris, the strings are downloaded web pages in memory. What my app (a
spider) does is to find out if a page contains one of several pattern
strings, if it is, then it's saved to disk. I would like to know if my
current approach is an efficient one.

(Realizing your addressing your comment to Chris Uppal, but I'll respond
anyway. Hope you don't mind.)

In this case, you can be practically guaranteed that String matching is
not your bottleneck. It will be much faster than retrieving those same
strings over a network connection.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
R

Rhino

Chris Smith said:
Why would anyone want to use regular expressions to look for a plain
String? Regular expressions are a tool for deciding whether a String or
some subset thereof matches a defined regular language. Although string
searching is a trivial degeneration of the problem addressed by regular
expressions, there's no need to introduce the added complexity.

If you are set on using regular expressions, converting the arbitrary
input string into a pattern that matches only itself is the first task.
Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated
in a method called Pattern.quote(String). Once that's done, you can
search for that pattern. It's far easier, though, to use String.indexOf
(prior to 1.5) or String.contains (as of 1.5).
I didn't say regular expressions were the _best_ way, just that they were
an option.

The original poster mentioned that he was searching very long strings for
other strings. I don't know if regular contains() or similar methods have
length limitations and I've never looked into whether the more traditional
methods are more or less efficient than techniques based on regular
expressions. Offhand, I suspected that regular expressions might be more
efficient on longer strings but the other replies to this thread indicate
that regular expressions are actually _less_ efficient that contains() or
its brethren.

Therefore, I learned something along the way, too.

Rhino
 
A

Alan Moore

I'd like to add a note about the inefficiency of regexes. The regex
".*is not.*" is inherently slow, and not a fair test of regex speed.
Adding the dot-stars before and after the real search string is a hack
for when you can only use String#matches(). When using the find()
method, you can get rid of them and the search will go much faster.
When I ran the OP's second test program with the regex "is not", it
took 1052ms (compared to 4486ms for ".*is not.*"). That was still
three times as long as the contains() or indexOf() approaches, so I'm
not pushing for regex use here. Just setting the record straight.
 
H

Hiran Chaudhuri

Thomas Weidenfeller said:
"best" always depends on the context, the word as such is rather
meaningless. I assume you mean fast. Then, contrary to other suggestions,
regexps are not a good choice at all. We have discussed this in the past.

Agree, best is ambiguous. But the message subject line contains 'quickest',
so the question must be runtime.

Hiran
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top