find words that contains some specific letters

F

Fralentino

Hi!

I want to make an algorithm that searches for all words in a
dictionary that contains some specific set of letters. For example i
want to find all words containing the exact letters t,a,s,m,p so the
word stamp is ok.

So basically i want to compare a String with a set of letters and find
out if you can build this string with these exact letter. I have a
solution for this but i dont think it's the the most optimal one so i
would like som tips on how do to this in a optimal and rather easy
way. Is there some recommended designs in how to do this? Is there
perhaps some method or class in the java api that does this for you?

Thanks.
/Baretos
 
F

Fralentino

What is your solution?

My solution:
boolean compareLetters(String toBeCompared,ArrayList<String> letters)
{

for(int i = 0;i<toBeCompared.length();i++)
{
if(letters.contains(""+toBeCompared.charAt(i)))
{
letters.remove(""+toBeCompared.charAt(i));
}
else
return false;
}
return true;

}
 
J

John B. Matthews

Fralentino said:
My solution:
boolean compareLetters(String toBeCompared,ArrayList<String> letters)
{

for(int i = 0;i<toBeCompared.length();i++)
{
if(letters.contains(""+toBeCompared.charAt(i)))
{
letters.remove(""+toBeCompared.charAt(i));
}
else
return false;
}
return true;

}

IIUC, you then have to run this method on every word in your dictionary.

Another approach is to permute the letters and search the dictionary.
A binary search of an ordered word list is O(log n), worst case.
Here's an example in Ada:

<http://home.roadrunner.com/~jbmatthews/jumble.html>

In Java, permutations can be checked quickly with the contains() method
of the Set interface, once the dictionary is read:

private static final String NAME = "/usr/share/dict/words";
private static final Set<String> set = new TreeSet<String>();
static {
try {
File file = new File(NAME);
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file)));
String s;
while ((s = in.readLine()) != null) {
set.add(s);
}
} catch (IOException ex) {
System.err.println(ex.getMessage());
}
}

Finding the permutations of a string in Java is straightforward:

<http://www.google.com/search?q=java+permute>
 
G

Giovanni Azua

Hi John,

John B. Matthews said:
Another approach is to permute the letters and search the dictionary.
A binary search of an ordered word list is O(log n), worst case.
Here's an example in Ada:

<http://home.roadrunner.com/~jbmatthews/jumble.html>

In Java, permutations can be checked quickly with the contains() method
of the Set interface, once the dictionary is read:

private static final String NAME = "/usr/share/dict/words";
private static final Set<String> set = new TreeSet<String>();
static {
try {
File file = new File(NAME);
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file)));
String s;
while ((s = in.readLine()) != null) {
set.add(s);
}
} catch (IOException ex) {
System.err.println(ex.getMessage());
}
}

Finding the permutations of a string in Java is straightforward:

<http://www.google.com/search?q=java+permute>
I was wondering what the overall lookup complexity would be here ...

Assuming "n" being the number of letters in a search word and "m" the total
number of words in the dictionary then the complexity would be:

o(n! * log2 m) => O(n! * log m) isn't this NP complexity? I am looking for
the exponential function that approximates n! I remember in class we used to
evaluate complexity using the O notation with exponential function rather
than n! ...

Best regards,
Giovanni
 
G

Giovanni Azua

Giovanni Azua said:
I was wondering what the overall lookup complexity would be here ...

Assuming "n" being the number of letters in a search word and "m" the
total number of words in the dictionary then the complexity would be:

o(n! * log2 m) => O(n! * log m) isn't this NP complexity? I am looking for
the exponential function that approximates n! I remember in class we used
to evaluate complexity using the O notation with exponential function
rather than n! ...
I found it here "Stirling's approximation"
<http://en.wikipedia.org/wiki/Factorial#Rate_of_growth>

so it would be o(n! * log2 m) => O(n^n * log m) and this is NP ...

Best regards,
Giovanni
 
L

Lew

Giovanni said:
I found it here "Stirling's approximation"
<http://en.wikipedia.org/wiki/Factorial#Rate_of_growth>

so it would be o(n! * log2 m) => O(n^n * log m) and this is NP ...

You only build the Set of permuted letters once. Then, following John
Matthews's suggestion to which you replied, you look up the n! permutations
with an O(1) lookup in the Set of dictionary words. So the actual complexity
is just O(n!)

Or one build the dictionary as a Map indexed by word letters in alphabetical
order with the values being corresponding Sets of words using those letters.
Then you only do an O(1) lookup into the Map to find the single ordered
permutation of the search term, then return the matching Set directly. So now
the overall lookup complexity is that of sorting the letters in the search term.
 
A

Arved Sandstrom

Fralentino said:
Hi!

I want to make an algorithm that searches for all words in a
dictionary that contains some specific set of letters. For example i
want to find all words containing the exact letters t,a,s,m,p so the
word stamp is ok.

So basically i want to compare a String with a set of letters and find
out if you can build this string with these exact letter. I have a
solution for this but i dont think it's the the most optimal one so i
would like som tips on how do to this in a optimal and rather easy
way. Is there some recommended designs in how to do this? Is there
perhaps some method or class in the java api that does this for you?

Thanks.
/Baretos

Assuming that we are talking about a fresh search every time, your
proposed algorithm (later post) is probably not all that bad, although
I'd be inclined to take the letters of interest, put them in a HashSet,
and then process each letter of a dictionary word and see if the set
contains it.

Any near-optimal method would also take into account the frequency of
letters in a language. For example, if you saw that your set of letters
contained one rare letter, for example, you'd save time overall by first
doing a search for this letter specifically in each dictionary word, and
discarding all of them (the large majority) that do not contain that
rare letter (*). Then you can do the comprehensive search.

AHS

* If your entire dictionary was composed into a single string, and you
maintained an offsets table, you could just burn through that large
string and look for the rare letter locations, and from the table figure
out what words contain it.
 
G

Giovanni Azua

Hi Lew,

Lew said:
You only build the Set of permuted letters once. Then, following John
Once per search to be precise.
Matthews's suggestion to which you replied, you look up the n!
permutations with an O(1) lookup in the Set of dictionary words. So the
actual complexity is just O(n!)
One word lookup in the Set costs O(log m) binary search and not O(1).
Therefore the O(log m) is *for each* generated permutation, and this is why
the multiplication i.e. O(n! * log m)
Or one build the dictionary as a Map indexed by word letters in
alphabetical order with the values being corresponding Sets of words using
those letters. Then you only do an O(1) lookup into the Map to find the
single ordered permutation of the search term, then return the matching
Set directly. So now the overall lookup complexity is that of sorting the
letters in the search term.
I was writing meantime a similar algorithm to this one you explain ... you
have to watch for multiple occurrences of the same letter though and the Set
should be SortedSet so there is calculating intercept of the Sets which is
O(n) if the Sets are SortedSet.

Best regards,
Giovanni
 
T

Tom McGlynn

A somewhat different approach to the text manipulation methods
discussed elsewhere on the thread would be to map each letter to a
prime number and compute the product corresponding to the letters in
each word. You would be interested in words that had a specific value
of that product since permutations of the letters will not affect it
-- and since you are using primes factors you can't have any
interlopers. The 26th prime is 101, so you could do up to about 8
letter words easily with longs. After that you'd eventually need to
use BigIntegers. Using the small primes for common letters is an
obvious optimization (e=2,t=3,... for English). I've no idea how this
would stack up in efficiency with the other approaches.

Regards,
Tom McGlynn
 
G

Giovanni Azua

Ciao Fralentino,

Fralentino said:
Hi!

I want to make an algorithm that searches for all words in a
dictionary that contains some specific set of letters. For example i
want to find all words containing the exact letters t,a,s,m,p so the
word stamp is ok.

So basically i want to compare a String with a set of letters and find
out if you can build this string with these exact letter. I have a
solution for this but i dont think it's the the most optimal one so i
would like som tips on how do to this in a optimal and rather easy
way. Is there some recommended designs in how to do this? Is there
perhaps some method or class in the java api that does this for you?
I thought of a possible algorithm to resolve this problem without using
permutations btw I think you also have to take into account the case of
repeated letter occurrences. The idea would be to build a suitable structure
so the lookup time is the fastest.

You would build the following structure from your dictionary once:

Map<String, SortedSet<String>> this is a map of SortedSet of words in the
dictionary keyed by the letters they occur in, and the words in the
SortedSet are ordered alphabetically e.g.

map.get("a").contains("fall") == true
map.get("a").contains("aurora") == false
map.get("aa").contains("aurora") == true // resolves the issue of words
with repeated occurrences of a letter
map.get("a").contains("brown") == false

The lookup algorithm would be ("n" is the number of input letters and "m"
number of words in the dictionary):

1-. Group the input letters e.g. 'a', 'u', 'r', 'o', 'r', 'a' into 'aa',
'o', 'rr', 'u', this can be done worst case O(n)

2-. Now for each of these groups retrieve the right SortedSet, this would be
o(n * c) where c is the constant time to lookup in a Map => big O(n)

3-. At this point find the intercept of all SortedSets, intercepting two
SortedSet is O(m)

Therefore the overall complexity would be O(n * m) which is better than
O(n^n * log m). Finding the intercepts of multiple SortedSets can be easily
done in parallel. I think then computing the intercept of every two
SortedSets in parallel would make the overall complexity O(n * log m)

Best regards,
Giovanni
 
G

Giovanni Azua

Giovanni Azua said:
Therefore the overall complexity would be O(n * m) which is better than
O(n^n * log m). Finding the intercepts of multiple SortedSets can be
easily done in parallel. I think then computing the intercept of every two
SortedSets in parallel would make the overall complexity O(n * log m)
Sorry the overall using parallelization would be O(log n * m) because you
would parallelize on the number of sets which worst-case corresponds to the
number of input letters. The O(m) is what takes to calculate the intercept
of any two Sets of worst-case length m words in the dictionary.

Best regards,
Giovanni
 
L

Lew

Giovanni said:
One word lookup in the Set costs O(log m) binary search and not O(1).

That is incorrect for HashSet, assuming you mean 'm' to be the set
size.
Therefore the O(log m) is *for each* generated permutation, and this is why
the multiplication i.e. [sic] O(n! * log m)

According to Sun's documentation for HashSet:
This class offers constant time performance for the basic operations
(add, remove, contains and size), assuming the hash function disperses
the elements properly among the buckets.

The term "constant time" means O(1). Therefore the lookup time is O
(1) for each generated permutation, and this is why the multiplication
is O(n! * 1 ).

Likewise, one word lookup in a HashMap <String, Set<String>> is O(1).
If you use only a single permutation to do the lookup, i.e., the
alphabetically sorted one, then you only do a single lookup for a
HashMap, not n! lookups.
I was writing meantime a similar algorithm to this one you explain ... you
have to watch for multiple occurrences of the same letter though and the Set
should be SortedSet so there is calculating intercept of the Sets which is
O(n) if the Sets are SortedSet.

The OP asked to find "all words in a dictionary that contains some
specific set of letters. ... containing the exact letters ..." If you
implement their "set of letters" as a String containing the letters in
alphabetic order, then you can include duplicated letters as part of
the search term. You wouldn't want a SortedSet to be the dictionary;
a Map is better, specifically a HashMap<String, Set<String>>. You do
an O(1) lookup of the search term, that is, a String comprising the
search letters in order, and get back the Set of matching words in a
single get().

Wouldn't you agree that the O(1) algorithm is a better choice than an O
(n) one?
 
L

Lew

Ciao Fralentino,





I thought of a possible algorithm to resolve this problem without using
permutations btw I think you also have to take into account the case of
repeated letter occurrences. The idea would be to build a suitable structure
so the lookup time is the fastest.

You would build the following structure from your dictionary once:

Map<String, SortedSet<String>> this is a map of SortedSet of words in the
dictionary keyed by the letters they occur in, and the words in the
SortedSet are ordered alphabetically e.g.

map.get("a").contains("fall") == true
map.get("a").contains("aurora") == false
map.get("aa").contains("aurora") == true  // resolves the issue of words
with repeated occurrences of a letter

The OP asked for words comprising the exact letters of the search
term, not merely including the searched letters.
basically i [sic] want to compare a String with a set of letters
and find out if you can build this string with these exact letter.


Thus, "aurora" would not be a valid match for "aa", although it would
be a match for "aaorru".
 
G

Giovanni Azua

Lew said:
That is incorrect for HashSet, assuming you mean 'm' to
be the set size.
You are wrong here, in general HashSet can end up worst case O(n) as in this
particular case being discussed, unless you make the wrong assumption that
all words in a dictionary fall each under a separate HashMap bucket and this
is NOT possible, there is no such hash function. This is why Matthews
explicitly mentioned and chose the binary search which is always worst case
O(log n).

Another excerpt from the HashMap javadoc "This class makes no guarantees as
to the order of the map; in particular, it does not guarantee that the order
will remain constant over time."

For this very reason the binary search is the right choice and not a
HashMap.
The term "constant time" means O(1). Therefore the lookup time is O
(1) for each generated permutation, and this is why the multiplication
is O(n! * 1 ).
You are wrong again, the constant time is defined as O(c) and not as O(1)
even a HashMap lookup involves a small number of operations and that is not
1. In general constant time is denoted using a constant e.g. c
Wouldn't you agree that the O(1) algorithm is a better choice
than an O(n) one?
Generally yes, but in this particular problem you assume that searching in a
dictionary is constant time using a HashMap and you are sadly mistaken.

Best regards,
Giovanni
 
G

Giovanni Azua

Hi Lew,

Lew said:
Thus, "aurora" would not be a valid match for "aa", although it would
be a match for "aaorru".
This is not what the algorithm I described before does. Please double-check
it carefully.

The input, a set of chars e.g. 'a' 'u' 'r' 'o' 'r' 'a' will match according
to the algorithm I defined before the following buckets:

SortedSet1 - "aa" - this means 'a' occurs 2x in the word
SortedSet2 - "rr" - this means 'r' occurs 2x in the word
SortedSet3 - "o"
SortedSet4 - "u"

Doing the map lookup you get these four SortedSet back. Finally calculating
the interception of these four SortedSet you get the desired result e.g.
"aurora" and many more.

Best regards,
Giovanni
 
J

John B. Matthews

"Giovanni Azua said:
Hi Lew,



Once per search to be precise.

One word lookup in the Set costs O(log m) binary search and not O(1).
Therefore the O(log m) is *for each* generated permutation, and this is
why the multiplication i.e. O(n! * log m)

Thank you both for elucidating this. If I understand big O notation
correctly, the factorial term dominates the complexity, giving O(n!)
overall. Of course, I would welcome clarification.

Indeed, my approach is fine for small permutations (e.g. word puzzles,
password strength), but hashing on the sorted letters of the dictionary
entries beforehand gives what a appears to be an O(n log n) algorithm:

<http://en.wikipedia.org/wiki/Jumble>
 
G

Giovanni Azua

Hi John,

John B. Matthews said:
Thank you both for elucidating this. If I understand big O notation
correctly, the factorial term dominates the complexity, giving O(n!)
overall. Of course, I would welcome clarification.
I have to double check but I am actually not sure it is only O(n!) the
reason is that "n" and "m" are two different measures here and will be
wildly different, so I am not sure from the top of my head that it will be
O(n!) taking over rather than O(n! * log m). If the log was on top of the
'n' yes by all means the factorial would dominate but not for the 'm'.

Best regards,
Giovanni
 
L

Lew

Giovanni said:
This is not what the algorithm I described before does. Please double-check
it carefully.

That is my point exactly. Your algorithm does not do what the OP
asked for. Please double-check it carefully.
The input, a set of chars e.g. 'a' 'u' 'r' 'o' 'r' 'a' will match according
to the algorithm I defined before the following buckets:

SortedSet1 - "aa" - this means 'a' occurs 2x in the word
SortedSet2 - "rr" - this means 'r' occurs 2x in the word
SortedSet3 - "o"
SortedSet4 - "u"

Doing the map lookup you get these four SortedSet back. Finally calculating
the interception of these four SortedSet you get the desired result e.g.
"aurora" and many more.

The Map lookup does what the OP described, not what you described, as
I explained.
 
L

Lew

Giovanni said:
You are wrong here, in general HashSet can end up worst case O(n) as in this

You mean O(m), right?
particular case being discussed, unless you make the wrong assumption that
all words in a dictionary fall each under a separate HashMap bucket and this
is NOT possible, there is no such hash function. This is why Matthews
explicitly mentioned and chose the binary search which is always worst case
O(log n).

There will be a small List or similar structure at each bucket of the
Set, but generally speaking those lists will be very small relative to
m. That is why the Javadocs for HashSet claim constant-time
performance for HashSet#contains(). Are you saying the Javadocs are
wrong?

It is not common to do binary searches on HashSets.

HashSet lookups tend to be much faster than binary searches because
the hash lookup takes one to the correct bucket in O(1) time, then
there is an O(x) search through the list at that bucket, where x is
some very small number. The nature of the hashing alogrithm should
keep x more-or-less constant with respect to m, thus the claim of
constant-time complexity for 'contains()' is not invalidated.

Again, this is the claim that the Javadocs make. I feel very
comfortable agreeing with the Javadocs on this matter.
Another excerpt from the HashMap javadoc "This class makes no guarantees as
to the order of the map; in particular, it does not guarantee that the order
will remain constant over time."

For this very reason the binary search is the right choice and not a
HashMap.

Except that a HashMap gives O(1) performance and the complexity
measure of a binary search is much worse.

Order of the HashMap is not relevant; one finds the correct entry
directly via the search-term hash and a short linear search through
the matching bucket. The size of each bucket does not depend on m for
typical dictionaries.
You are wrong again, the constant time is defined as O(c) and not as O(1)

Wikipedia agrees with me:
<http://en.wikipedia.org/wiki/Big-O_notation>
Note the first table entry of
<http://en.wikipedia.org/wiki/Big-
O_notation#Orders_of_common_functions>

Note that one of the algorithms given as having O(1) complexity in
that table is "using a constant-size lookup table or hash table".
even a HashMap lookup involves a small number of operations and that is not
1. In general constant time is denoted using a constant e.g. c

Not according to my math professors or any source I've read on big-O
notation. They all use "O(1)". See the Wikipedia reference that I
cited.
Generally yes, but in this particular problem you assume that searching in a
dictionary is constant time using a HashMap and you are sadly mistaken.

I am not mistaken, nor happily nor sadly, if Wikipedia and Sun's
Javadocs are to be believed. I've quoted Wikipedia's assertion that
hash table lookups are O(1). The Javadocs for HashMap state
explicitly, "This implementation provides constant-time performance
for the basic operations (get and put) ...".

I think I will believe the Javadocs. This belief is supported by
understanding the algorithm at the heart of the HashMap#get()
operation.

I agree that their analysis does not account for the time it takes to
sort the 'n' characters of the search term and the O(n) calculation of
the hash code for the search term. Since n is far less than m,
typically no more than ten and nearly never above a hundred for most
human languages, we can consider that the search term length is not as
severe a factor.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top