Roedy said:
Is there something like HashMap but that optimised when nearly always
the thing you are looking up is not in the list, and when you can add
the list of words to look up and then freeze it.
I have to scan an entire website looking up every word.
Unless said web site is
www.archive.org, just create a HashMap
and use a Collections.unmodifiableMap() wrapper for lookup. I'd
expect memory to become a problem way earlier than processor.
Depending on how often each String is looked up, and how long the
Strings are, a data structure that does not need String.hashCode()
might be faster.
I'd consider a tree along the following lines:
Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
'`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
boolean and a Node[27]. The tree contains a word if there is a path
from the root to a Node whose boolean is true and that is reached via
Node[String.charAt(l)] at level l.
For example, a tree containing exactly
"a", "at", "and", "no", "nil", "not"
would be
N0 = { false, [ null, N1, 12*null, N2, 12*null ] } // ""
N1 = { true, [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
N3 = { false, [ null, 13*null, N7, 12*null ] } // "an"
N4 = { true, [ null, 26*null ] } // "at"
N5 = { false, [ null, 11*null, N8, 14*null ] } // "ni"
N6 = { true, [ null, 19*null, N9, 6*null ] } // "no"
N7 = { true, [ null, 26*null ] } // "and"
N8 = { true, [ null, 26*null ] } // "nil"
N9 = { true, [ null, 26*null ] } // "not"
For a Map, replace the boolean with a field holding the value.
Possible improvements:
- Since we have three (32bit VM) or seven (64bit VM) padding bytes
in each Node, we could use two of them to hold a short indicating
the maximum length of Strings in the subtree.
- Instantiate Node[] only if actually needed to hold a child, and only
large enough to hold all currently known children. Use one of the
padding bytes to hold an index offset, such that a Node with only
a single child only holds a Node[1]:
N4 = { true, 0, null }
N6 = { true, 20, [ N9 ] }
- Use an interface and specialized concrete classes for
- leaf node (no fields at all, because boolean will always be true)
- single child node { boolean, Node }
or even "true" and "false" single child nodes
- ...
but this will considerable increase preprocessing cost.
- If many of your words contain non-[a-z] characters, provide array
positions for the most common of these, or use a linear probing
hash table for the non-[a-z] children.
- If the tree is very sparse, doing so even for the [a-z] children
might be better in the long run because more of it fits into caches.
This surely has been invented before and given a nice name...
Apache's or Google's collection libraries might even feature
high quality implementations.