Fast String search algorithm

Roedy Green · Dec 16, 2011

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.
--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

Tom Anderson · Dec 16, 2011

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?
http://en.wikipedia.org/wiki/Inverted_index

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Exactly that!

Similarly, I was wondering how you might build an index to all the files
on a computer so that you could find one just by giving a wild card or
partial filename.

There is a standard utility on unix that does just that:

http://en.wikipedia.org/wiki/Locate_(Unix)

The source is in C, but might be informative (this is for the 'mlocate'
version):

https://fedorahosted.org/mlocate/browser/src

tom

Tassilo Horn · Dec 16, 2011

Roedy Green said:
Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

Sure. See http://en.wikipedia.org/wiki/Substring_index

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.

On UNIXes, that's exactly what the locate tool does. Then you can
search with globbing wildcards

% # all files containing the substring foo
% locate *foo*

or even with regular expressions

% # all files that end with "r.c" or "r.cpp"
% locate -r '.*r\.$c\|cpp$$'

Bye,
Tassilo

Gene Wirchenko · Dec 16, 2011

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

Look up the Boyer-Moore string search algorithm.

[snip]

Sincerely,

Gene Wirchenko

glen herrmannsfeldt · Dec 16, 2011

Gene Wirchenko said:
On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green

Look up the Boyer-Moore string search algorithm.

Well, Boyer-Moore still scans linearly, but just does it faster.

Still, you could use something Boyer-Moore related if you know which
characters were in each record. If a record didn't have a character
in your substring, then it couldn't have the substring.

Still, understand Boyer-Moore before you get much farther
with string searching.

-- glen

Roedy Green · Dec 16, 2011

On UNIXes, that's exactly what the locate tool does. Then you can
search with globbing wildcards

i am trying to talk the JPSoft people into adding a locate feature
into Take Command. I wanted to be able to tell them roughly how to
implement it.

They file descriptions, and fast change directory give partial
directory names. I hoped they could come up with an integrated system
that was thread safe. (The current ones locks everybody out during
refresh).
--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

Arne Vajhøj · Dec 17, 2011

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.

Use a database that support full text search.

Arne

Lew · Dec 17, 2011

Arne said:
Use a database that support full text search.

Perhaps
<http://www.google.com/search?q=latent+semantic+indexing>

Latent semantic indexing, or LSI, is an algorithm that maps word associations
culled from document bases into cartographic clusters in n-space, then projects
those clusters into m-space, m < n. The Wikipedia article (found in the
referenced search) describes it in terms of the underlying matrix operations,
which is great fun for those who like such things but loses the forest for the
trees. I think of it as dusty blobs floating in infospace (like cyberspacebut
doesn't need computers) with bright spots where terms (words, phrases, other
semantic molecules) live near each other, like "driver" not far from "wedge".
With lots of documents and terms, you have a lot of points of light, m times n
dimensions. The reduction brings it down to about one to three hundred
dimensions, but it's still a hyperspatial map (in the cartographic sense).

It is compute-intensive to map out all those correlations but makes for not
only a fast search, but one that seems almost magical in its ability to pull
meaningful answers out of the morass. The term "latent semantic indexing"
refers to how all that dusty matrix math ("indexing") pulls out the meaning
("semantic") lurking in the way words coalesce in meaningful documents
("latent"). The correlations capture the correlations in the content that
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.

To a person fond of maths, information science, library science, linguistics,
semiotics, cartography, AI or document management LSI is a delight.

markspace · Dec 17, 2011

http://en.wikipedia.org/wiki/Inverted_index

An implementation of an inverted index is often called a search engine.

<http://en.wikipedia.org/wiki/Search_engine_(computing)>

One of the best known implementations is a web search engine. Alt
Vista, Yahoo, and Google are well known examples.

There's a Java based search engine I happen to know about, Lucene.

<http://lucene.apache.org/java/docs/index.html>

This is a roll-your-own library, but it has a lot of features. Beyond a
cursory reading of the documentation I can't really say much about it
unfortunately.

Daniel Pitts · Dec 18, 2011

An implementation of an inverted index is often called a search engine.

<http://en.wikipedia.org/wiki/Search_engine_(computing)>

One of the best known implementations is a web search engine. Alt Vista,
Yahoo, and Google are well known examples.

There's a Java based search engine I happen to know about, Lucene.

<http://lucene.apache.org/java/docs/index.html>

This is a roll-your-own library, but it has a lot of features. Beyond a
cursory reading of the documentation I can't really say much about it
unfortunately.

Solr is a webservice(ish) program which is based around Lucene.

<http://lucene.apache.org/solr/>

Travers Naran · Dec 18, 2011

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.

In addition to the other links mentioned, here are some other links I
found helpful for the same problem:

http://en.wikipedia.org/wiki/Bloom_filter
http://en.wikipedia.org/wiki/Bitap_algorithm
http://johannburkard.de/software/stringsearch/

The Bloom Filter can be used to help reduce the set of files to search.

Fast multi-string search algorithm	1	Sep 27, 2011
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
two new JDKs	7	Dec 12, 2011
System.exit	11	Dec 13, 2011
ant and -bootclasspath	5	Dec 17, 2011
Thread question	14	Nov 30, 2011
Deployjava.js	2	Dec 8, 2011
Patricia trie vs binary search.	32	May 25, 2012

Fast String search algorithm

Roedy Green

Tom Anderson

Tassilo Horn

Gene Wirchenko

glen herrmannsfeldt

Roedy Green

Arne Vajhøj

Lew

markspace

Daniel Pitts

Travers Naran

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads