Responding toTimasmith...
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.
The key will be ordering, but for just 10K items one probably doesn't
want to get too elaborate.
If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?
I assume you mean in the Name rather than the ID. I also assume that the
data gets stored once but is searched often.
Is there a limit on the lengths of the Names? What is the average
length? The answers to these questions will drive tailoring the
solutions below to trade-off speed vs. effort. Can you define the ID
(e.g., an index for the order the data items are stored)?
I happen to be using java but I am open to general solutions.
Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).
I would be inclined to make use of a couple of practical
characteristics. In searching ASCII keys most rejections will occur in
the first few characters. (It has been many, many moons since I've done
searches so I forget the actual numbers for the distribution.) Also,
string matching algorithms for a given starting point in each string are
quite efficient.
I think one "cheap" approach would be to keep an ordered list of
digraphs or trigraphs that appear anywhere in the Name. That list gets
updated whenever an item is added/removed from the data. The list would
contain multiple entries consisting of the item ID and the start
position of the digraph/trigraph in the Name. (Maybe the remaining
length if the search keys are likely to be long so that you can
eliminate embedded substrings that would extend beyond the Name length.)
The data itself would be ordered in an array and the "ID" above would be
an index into the storage position (either the original ID or one you
create when storing the data). Your algorithm would find the starting
digraph/trigraph in the list above and "walk" the entries for that
digraph/trigraph. For each one it would do a lookup to get the data and
use a conventional substring matching algorithm for the full search
string at the indicated start point.
There are even less elaborate versions. You can simply keep a list of
data items that contain the starting letter of the search string
somewhere and search just them. Then all you need is to store the data
so you can access items by an index referenced in the starting letter list.
Bottom line: you have a trade-off between how fast you want to be and
how much work you want to put into storing the data items. As a fairly
general rule the search will be in direct proportional to how much space
you have for storing supporting infrastructure like index tables.
For example, you could create 26 trees whose nodes were letters. The
root node in each tree would be one of the possible first letters of the
Name. Each child node would be a character that immediately followed the
parent as the next letter of the Name (pretty much like a B-Tree). Each
node would have a list of items that contained the substring of letters
leading to that node. Then you could create a similar set of trees where
the root nodes were the second letter of the name and so on. The search
algorithm would "walk" the trees in each set from the node with the
proper starting letter. This would be very fast but it would require
substantially more storage space than the original data. [One could get
even more carried away if one uses a digraph or trigraph as the root
node in each tree. B-)]
*************
There is nothing wrong with me that could
not be cured by a capful of Drano.
H. S. Lahman
(e-mail address removed)
Pathfinder Solutionshttp://
www.pathfindermda.com
blog:
http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
(e-mail address removed) for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH