L
Lew
Could you explain what you mean by "the interception of these four
SortedSet"? I don't understand that part.
SortedSet
Could you explain what you mean by "the interception of these four
SortedSet"? I don't understand that part.
Could you explain what you mean by "the interception of these four
SortedSet"? I don't understand that part.
I assume he means 'intersection'. I mean, intersection, interception,
O(1), O(c), what's the difference, right?
Arved said:Assuming that we are talking about a fresh search every time, your
proposed algorithm (later post) is probably not all that bad, although
I'd be inclined to take the letters of interest, put them in a HashSet,
and then process each letter of a dictionary word and see if the set
contains it.
Any near-optimal method would also take into account the frequency of
letters in a language. For example, if you saw that your set of letters
contained one rare letter, for example, you'd save time overall by first
doing a search for this letter specifically in each dictionary word, and
discarding all of them (the large majority) that do not contain that
rare letter (*). Then you can do the comprehensive search.
AHS
* If your entire dictionary was composed into a single string, and you
maintained an offsets table, you could just burn through that large
string and look for the rare letter locations, and from the table figure
out what words contain it.
I figured it meant "intersection", too, but that's the part I don't
understand. What does it mean to take an intersection of the result
sets?
With a HashSet approach, you take each of the n! permutations of the
search string and determine 'dictionary.contains( permutation )'. The
result you want is the union of the permutations that return 'true'.
I do not understand why there would be a test for portions of the
search string, much less taking an "intersection" of those results.
With a HashMap<String, Set<String>> approach, the entire Set of
resultant dictionary words is indexed by the search string, so one
simple 'dictionary.get( searchTerm )' yields an entire result set
directly.
So I remain puzzled, and still need the answer.
With a HashMap<String, Set<String>> approach, the entire Set of
resultant dictionary words is indexed by the search string, so one
simple 'dictionary.get( searchTerm )' yields an entire result set
directly.
So I remain puzzled, and still need the answer.
I meant intersection.Giovanni Azua said:SortedSet1 - "aa" - this means 'a' occurs 2x in the word
SortedSet2 - "rr" - this means 'r' occurs 2x in the word
SortedSet3 - "o"
SortedSet4 - "u"
Doing the map lookup you get these four SortedSet back. Finally
calculating the interception of these four SortedSet you get the desired
result e.g. [snip]
I want to make an algorithm that searches for all words in a
dictionary that contains some specific set of letters.
Implementing <http://en.wikipedia.org/wiki/Jumble>, your HashMap<String,
Set<String>> makes an excellent dictionary. The Map takes a little extra
time to construct, but the result is static. Once the characters of an
input word are sorted, O(n log n), the lookup is indeed O(1).
<code>... [elided - good code]
</code>
big O is about worst-case complexity not about "relatively this or that",Lew said:You mean O(m), right?
yes
There will be a small List or similar structure at each bucket of the
Set, but generally speaking those lists will be very small relative to
m.
HashSet does not claim constant time, your interpretation does.That is why the Javadocs for HashSet claim constant-time
No, the javadocs are fine. I think your interpretation is wrong.performance for HashSet#contains(). Are you saying the Javadocs are
wrong?
Not only uncommon but plain wrong, there is no order in HashSets. I don'tIt is not common to do binary searches on HashSets.
Rubbish. The complexity of Hash tables is worst-case O(n). TreeSet orHashSet lookups tend to be much faster than binary searches because
the hash lookup takes one to the correct bucket in O(1) time, then
there is an O(x) search through the list at that bucket, where x is
some very small number. The nature of the hashing alogrithm should
keep x more-or-less constant with respect to m, thus the claim of
constant-time complexity for 'contains()' is not invalidated.
Javadoc can't make it for a book in algorithms and complexity, it justAgain, this is the claim that the Javadocs make. I feel very
comfortable agreeing with the Javadocs on this matter.
exactly. This is O(m) i.e. all elements (or many) fall under a few bucketsOrder of the HashMap is not relevant; one finds the correct entry
directly via the search-term hash and a short linear search through
the matching bucket. The size of each bucket does not depend on m for
typical dictionaries.
Yes I made a mistake here. I thought the notation was O(c) but it is O(1)The term "constant time" means O(1). [snip]
This line will return wrong matches that do not satisfy the requirement ofJohn B. Matthews said:Implementing <http://en.wikipedia.org/wiki/Jumble>, your HashMap<String,
Set<String>> makes an excellent dictionary. The Map takes a little extra
time to construct, but the result is static. Once the characters of an
input word are sorted, O(n log n), the lookup is indeed O(1).
<code>
[snip]
Set<String> words = map.get(new String(ba));
I bet you didTom McGlynn said:I can take a crack at it... I understood Giovanni as saying that we
pre-compute something like
The retainAll is indeed an intersection implementation but surprisingly itSet doesn't have a named intersection operator, but it does
have retainAll which basically does intersection. [Though
one needs to be check that it's implemented.]
Good catch with the length I did not see that before good job! I amThis is I think what Giovanni described,
Exactly.
but I don't think it's quite enough. I think we also have to constrain
the length of the words to be the same as the length of the original
key. (E.g., "auroras" would match
"Giovanni Azua said:John B. Matthews said:Implementing <http://en.wikipedia.org/wiki/Jumble>, your
HashMap<String, Set<String>> makes an excellent dictionary. The Map
takes a little extra time to construct, but the result is static.
Once the characters of an input word are sorted, O(n log n), the
lookup is indeed O(1).
<code>
[snip]
Set<String> words = map.get(new String(ba));
This line will return wrong matches that do not satisfy the
requirement of the OP.
You wrongly assume that for every different sorted input "ba" String
there will be a different hashCode or that they will always fall
under different buckets and that is not true.
Giovanni said:Not only uncommon but plain wrong, there is no order in HashSets. I don't
know where you got this idea from.
Giovanni said:This line will return wrong matches that do not satisfy the requirement of
the OP. You wrongly assume that for every different sorted input "ba" String
there will be a different hashCode or that they will always fall under
different buckets and that is not true.
Giovanni said:big O is about worst-case complexity not about "relatively this or that",
hoping small rather than big data size is wishful thinking.
m being the size of the dictionary:
if you mean small being m/1000 this is still O(m)
if you mean small being m/1000000 this is still O(m)
disperses the elements properly among the buckets
http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashSet.html
e.g. you can place a hashCode implementation that always returns 1 and that
will proof the constant time offering wrong or simply the data being
modelled does not have a perfect hashCode e.g. String.
For example the Character class.Patricia Shanahan said:A perfect hash function is one that guarantees distinct hash codes for a
specific set of keys. There are algorithms for constructing one, given a
set of keys.
"Giovanni Azua said:For example the Character class.
John said:I think all of the primitive wrappers (except Double) have trivially
perfect integer hashes, as they wrap distinct values. Patricia was
perhaps referring to algorithms of the kind discussed here:
<http://en.wikipedia.org/wiki/Perfect_hash_function>
John said:The jumble algorithm relies on a hashed map for efficiency, but a
perfect hash is not essential. I recently implemented the algorithm
in both Java and Ada using library routines:
<http://en.wikipedia.org/wiki/Jumble>
<http://sites.google.com/site/drjohnbmatthews/jumble>
<http://home.roadrunner.com/~jbmatthews/jumble.html>
I was intrigued to compare the two languages' library implementations
of a general purpose string hash. Using a multiplicative function with
an initial value of zero, both offer excellent performance. Java uses a
multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of
eight on an unsigned (modular) value whose size is implementation
defined (e.g. mod 2**32):
<http://www.docjar.org/html/api/java/lang/String.java.html>
<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.