Fast String search algorithm

R

Roedy Green

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.
--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 
T

Tom Anderson

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?
http://en.wikipedia.org/wiki/Inverted_index

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Exactly that!
Similarly, I was wondering how you might build an index to all the files
on a computer so that you could find one just by giving a wild card or
partial filename.

There is a standard utility on unix that does just that:

http://en.wikipedia.org/wiki/Locate_(Unix)

The source is in C, but might be informative (this is for the 'mlocate'
version):

https://fedorahosted.org/mlocate/browser/src

tom
 
T

Tassilo Horn

Roedy Green said:
Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

Sure. See http://en.wikipedia.org/wiki/Substring_index
Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.

On UNIXes, that's exactly what the locate tool does. Then you can
search with globbing wildcards

% # all files containing the substring foo
% locate *foo*

or even with regular expressions

% # all files that end with "r.c" or "r.cpp"
% locate -r '.*r\.\(c\|cpp\)$'

Bye,
Tassilo
 
G

Gene Wirchenko

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

Look up the Boyer-Moore string search algorithm.

[snip]

Sincerely,

Gene Wirchenko
 
G

glen herrmannsfeldt

Gene Wirchenko said:
On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green
Look up the Boyer-Moore string search algorithm.

Well, Boyer-Moore still scans linearly, but just does it faster.

Still, you could use something Boyer-Moore related if you know which
characters were in each record. If a record didn't have a character
in your substring, then it couldn't have the substring.

Still, understand Boyer-Moore before you get much farther
with string searching.

-- glen
 
R

Roedy Green

On UNIXes, that's exactly what the locate tool does. Then you can
search with globbing wildcards

i am trying to talk the JPSoft people into adding a locate feature
into Take Command. I wanted to be able to tell them roughly how to
implement it.

They file descriptions, and fast change directory give partial
directory names. I hoped they could come up with an integrated system
that was thread safe. (The current ones locks everybody out during
refresh).
--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 
A

Arne Vajhøj

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.

Use a database that support full text search.

Arne
 
L

Lew

Arne said:
Use a database that support full text search.

Perhaps
<http://www.google.com/search?q=latent+semantic+indexing>

Latent semantic indexing, or LSI, is an algorithm that maps word associations
culled from document bases into cartographic clusters in n-space, then projects
those clusters into m-space, m < n. The Wikipedia article (found in the
referenced search) describes it in terms of the underlying matrix operations,
which is great fun for those who like such things but loses the forest for the
trees. I think of it as dusty blobs floating in infospace (like cyberspacebut
doesn't need computers) with bright spots where terms (words, phrases, other
semantic molecules) live near each other, like "driver" not far from "wedge".
With lots of documents and terms, you have a lot of points of light, m times n
dimensions. The reduction brings it down to about one to three hundred
dimensions, but it's still a hyperspatial map (in the cartographic sense).

It is compute-intensive to map out all those correlations but makes for not
only a fast search, but one that seems almost magical in its ability to pull
meaningful answers out of the morass. The term "latent semantic indexing"
refers to how all that dusty matrix math ("indexing") pulls out the meaning
("semantic") lurking in the way words coalesce in meaningful documents
("latent"). The correlations capture the correlations in the content that
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.

To a person fond of maths, information science, library science, linguistics,
semiotics, cartography, AI or document management LSI is a delight.
 
M

markspace


An implementation of an inverted index is often called a search engine.

<http://en.wikipedia.org/wiki/Search_engine_(computing)>

One of the best known implementations is a web search engine. Alt
Vista, Yahoo, and Google are well known examples.

There's a Java based search engine I happen to know about, Lucene.

<http://lucene.apache.org/java/docs/index.html>

This is a roll-your-own library, but it has a lot of features. Beyond a
cursory reading of the documentation I can't really say much about it
unfortunately.
 
D

Daniel Pitts

An implementation of an inverted index is often called a search engine.

<http://en.wikipedia.org/wiki/Search_engine_(computing)>

One of the best known implementations is a web search engine. Alt Vista,
Yahoo, and Google are well known examples.

There's a Java based search engine I happen to know about, Lucene.

<http://lucene.apache.org/java/docs/index.html>

This is a roll-your-own library, but it has a lot of features. Beyond a
cursory reading of the documentation I can't really say much about it
unfortunately.
Solr is a webservice(ish) program which is based around Lucene.

<http://lucene.apache.org/solr/>
 
T

Travers Naran

Let's say I had a million records each with a text field. I wanted to
find all records that contained a given substring. Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words. Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.

In addition to the other links mentioned, here are some other links I
found helpful for the same problem:

http://en.wikipedia.org/wiki/Bloom_filter
http://en.wikipedia.org/wiki/Bitap_algorithm
http://johannburkard.de/software/stringsearch/

The Bloom Filter can be used to help reduce the set of files to search.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top