Most efficient algorithm for matching

H

H. S. Lahman

Responding to Timasmith...
So I can point you to a file which is near exactly what I would be
searching.

http://www.fda.gov/cder/ndc/ziptext.zip

It is a list of medications. There are 50,000 which is a reasonable
representation. Look at column G in listings.txt. If you take that
it is 1.5MB in size.

If I load up the items into an ArrayList and then search linearly for
items matching WELLBUTRIN

then do to 1000 searches takes 9-11 seconds. So 100 searches a second
seems rather fast, maybe I am worrying about nothing.

It depends upon the context. If you only need the search as a response
to particular interactive user request, then that 10 ms will not be
noticeable compared to the hundreds of milliseconds for the user's
typing. (It will probably take longer just to read the data from disk.)

OTOH, if your software is in the bowels of an HMO's multi-user
application processing 1K pharmacy coverage requests a second, you
probably need to speed things up a tad. [Though I don't know why a
pharmacy would be providing partial keys; maybe they can't read the
doctor's writing. B-)] In fact, then you probably don't want to read the
DB at all for an individual request. You probably want to read the data
into memory at startup and answer the queries from memory. If the data
is only using 1.5 Mb, you could probably have another 5-10 Mb of
supporting data structures for the searches w/o getting into page
faulting trouble.

Bottom line: I think you need to look at the situation and determine
what is acceptable performance and then do any trade-offs from there.


*************
There is nothing wrong with me that could
not be cured by a capful of Drano.

H. S. Lahman
(e-mail address removed)
Pathfinder Solutions
http://www.pathfindermda.com
blog: http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
(e-mail address removed) for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH
 
C

Chris Uppal

David said:
I don't think so. From the docs "In order to prevent extremely slow
WildcardQueries, a Wildcard term should not start with one of the
wildcards * or ?".

I think you must be misinterpretting the Lucene documentation. Lucene is
specifically /for/ searching large volumes of text.

I suspect that what it is trying to warn of is that it you want to look for the
needle, "needle", in a multiGB haystack, you should ask Lucent to look for
"needle" not for "*needle".

-- chris
 
D

Daniel Pitts

I don't think so. From the docs "In order to prevent extremely slow
WildcardQueries, a Wildcard term should not start with one of the
wildcards * or ?".

You would use stemming to index the substrings, rather than wildcards
to search for it.
 
L

Lew

bugbear said:
For human interaction 1/100 of a second is good.
For human interaction with multiple users (e.g. webserver)
you might want more 1/1000.

No reason to be faster for any one transaction if they interleave well.

Where did you come up with the .01s figure? Is there ergonomic research behind it?

- Lew
 
S

S Perryman

Timasmith wrote:
Efficient text searching is a /big/ subject. No, let me be more precise:
it's
/HUGE/ subject.
If your list of items is reasonably short, and the items themselves are
reasonably short, then you can do yourself a big favour by ignoring
efficiency
and just searching linearly. A lot depends on the application, but I'm
not
convinced that 10K items is enough to justify heavier-weight solutions.
Next up, there are efficient text searching algorithms, such as
Boyer-Moore,
and the interesting newish breed of "bit-parallel" or "bitmap" searching
algorithms.

I have heard of the "bitmap" algorithms, but don't know how they compare
to the really good std ones like KMP (Boyer-Moore is subject to degeneracy
to
O(M * N) for certain alphabets while KMP should have fixed overhead
except probably for massively long search patterns - computing the "failure"
function etc) .


Regards,
Steven Perryman
 
D

David Segall

Chris Uppal said:
I think you must be misinterpretting the Lucene documentation. Lucene is
specifically /for/ searching large volumes of text.
True, but the OP does not have large volumes of text. He does not
specify the amount but, to me, "a name" implies at most a few hundred
characters.
I suspect that what it is trying to warn of is that it you want to look for the
needle, "needle", in a multiGB haystack, you should ask Lucent to look for
"needle" not for "*needle".
The OP emphasised that "I want to find every item that CONTAINS that
string". He asked for the ability to efficiently look for "*needle".
It is clear from the documentation that Lucene is optimised to look
for "words", that it can cope with "wor*" but that it is not the right
tool for "*ords".
 
D

David Segall

Daniel Pitts said:
You would use stemming to index the substrings, rather than wildcards
to search for it.
Stemming in Lucene reduces token terms to their "_linguistic_ root
form. e.g. reduces 'fishing' and 'fishes' to 'fish'". It would not
help find 'shing' in 'fishing'.
 
B

bugbear

Timasmith said:
No constraints on alphabet, most of the search targets are single
words from 3-30 characters.

Insertions is rare, the list does not change from day to day,
retreivals is measured in thousands or even tens of thousands per hour.

Given the other info in the thread,
I would try the following hybrid hack.

Concatanate all the Names into a single
array of characters, separated by some
"impossible" character (likely nul) with a second array
containing the offsets within the character "pool"
back to your Items.

Use a fast text search to locate
your target within the pool.

I would suggest:

A Fast Generic Sequence Matching Algorithm
David R. Musser, Gor V. Nishanov
http://citeseer.ist.psu.edu/586484.html

Having all the data in a memory contiguous
pool should speed things along, and the search
should be very fast.

Then use the (sorted, of course) offset array
to turn the offset into Items references.

BugBear
 
B

bugbear

Lew said:
No reason to be faster for any one transaction if they interleave well.

Where did you come up with the .01s figure? Is there ergonomic research
behind it?

No ; I'm afraid I made it up, on the grounds that
1/100th is clearly "good enough". Slower might
also be good enough, but I'm confident that 1/100 is OK.

(hell, it's quicker than many people's monitor refresh!)

BugBear
 
L

Lew

David said:
The OP emphasised that "I want to find every item that CONTAINS that
string". He asked for the ability to efficiently look for "*needle".
It is clear from the documentation that Lucene is optimised to look
for "words", that it can cope with "wor*" but that it is not the right
tool for "*ords".

But the /search/ term doesn't start with a wildcard, the /search/ term is
"ords" or "needle".

Consider

String s = "Text base in which to search for words.";
boolean containsSearchTerm = (s.indexOf( "ords" ) >= 0);

Note that this finds if the item "CONTAINS that string", but that the /search/
term does not start with a wildcard. The OP was asking for this capability
across secondary storage.

- Lew
 
C

Chris Uppal

David said:
Stemming in Lucene reduces token terms to their "_linguistic_ root
form. e.g. reduces 'fishing' and 'fishes' to 'fish'". It would not
help find 'shing' in 'fishing'.

Ah, I see what you mean now. Thanks.

-- chris
 
C

Christian

Christian said:
search for ukkonen's algorithm ...
its an algorithm for substring trees

building is in O(n)
storage needed in O(n)
and searching in O(1)
that may fit your needs ...

if you are not willing to index you could also try do use

Boyer Moore ... that needs no indexing and should be in average case

O(m +n/m)

with n textlenght and m length of the searched substring


(with substringtrees searching is not in O(1) its only independent from
n which in most cases should be good enough)
 
D

Daniel Pitts

Stemming in Lucene reduces token terms to their "_linguistic_ root
form. e.g. reduces 'fishing' and 'fishes' to 'fish'". It would not
help find 'shing' in 'fishing'.


Linguistic roots are only ONE implementation of stemming. Its
perfectly acceptible to stem "suffixes".
 
D

David Segall

Lew said:
But the /search/ term doesn't start with a wildcard, the /search/ term is
"ords" or "needle".
True, but Lucene's /index/ only contains "words".
Consider

String s = "Text base in which to search for words.";
boolean containsSearchTerm = (s.indexOf( "ords" ) >= 0);
Exactly! For Lucene, that is a wild card search which must be
performed on _every_ word in the index. The OP would be far better off
using that code on his 10,000 raw strings.Anyway, our debate over using Lucene has been overtaken by Christians
suggestion to use Ukkonen's algorithm. Apart from the thorny problem
of interpreting multi-word searches, which applies to all the
suggestions, the algorithm seems ideal. An early result of a Google
search contains a lucid explanation and some source code
<http://marknelson.us/1996/08/01/suffix-trees/>.
 
L

Lew

David said:
Anyway, our debate over using Lucene has been overtaken by Christians
suggestion to use Ukkonen's algorithm. Apart from the thorny problem
of interpreting multi-word searches, which applies to all the
suggestions, the algorithm seems ideal. An early result of a Google
search contains a lucid explanation and some source code
<http://marknelson.us/1996/08/01/suffix-trees/>.

Lew wrote on 2002-02-10:
- Lew
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top