In comp.lang.javascript message said:
Dr J R Stockton said the following on 11/26/2007 11:27 AM:
I have wanted a customized personal dictionary of my own for a while
now. The biggest problem I have had was trying to find a text file that
had a word list in it that I could trust.
Try Google for a combination of two or three entirely unrelated unusual
words, and you'll start finding possible lists. Taghairm Octothorpe
seems a bit too obscure a pair; but maybe you don't do taghairm in the
USA (it's not in my Websters).
^^^^^^^^^^^
That bit applies to an earlier version of the code :-(.
appears to take no time at all, meaning under about 15 ms. That's good
enough (caveat - building the dictionary is slower!). P4/3G, XP.
Dic = {}
for (J=0 ; J<1e5; J++) Dic[String(Math.sin(J))] = J
T = new Date()
J=1e4 ; while (J--) X = Dic[String(Math.sin(98765-3*J))]
Y = [new Date()-T, X]
Y becomes [172, 98765] or [156, 98765]; each lookup takes about 16
microseconds in a dictionary of 100,000- or is the code not testing it
correctly?
It isn't doing a comparison. It creates the dictionary then it sets the
var X to the value of a possible entry 1000 times. It is looking up
1,000 entries which is 999 more than it has to look up.
The code is designed to do lookups for timing, without bothering with
the trivial matter of reporting success in an appropriate form.
I'd not thought it necessary to explain that doing 1000 different
lookups was in order to take a measurable total time. X is the value of
the last lookup, as a check. BTW, changing the 1e4 did change the time
proportionately (as was confidently expected) and changing the 1e5
changed it much less (as was less confidently expected).
In the timed part, words that should be present are found. For those
who don't make too many errors, the time for a failed lookup is less
important. New test : insert +0.5 after 3*J . Virtually all lookups
now fail. Time taken is unchanged.
The flaw in the test is what made me realize how to do a dictionary
and make it simple and fast. Instead of setting the Dic entry to J, set
it to 1. Then, to find out if a word exists in the dictionary or not
you simply test for it:
I merely found it more convincing for the lookup to find the position.
Your code fragment only does the lookup, and does it in the same way as
my code. One can set the entries to true, and return either true or
undefined.
If the entry is there, it will return 1, convert it to true. If the
entry doesn't exist, then it returns undefined and converts it to
false. Let the browser do the lookup.
To correct myself, and admit I was thinking about it wrong, I don't
think the lookup is a problem. A 214,000 word dictionary is roughly 4.5
mbs so a 25,000 word dictionary should, guessing, be around 500kb or
so. Not bad on a broadband connection but murder on a dialup
connection.
A dictionary should compress automatically over modern dial-up, if in
alphabetical order; and one can write algorithms to compress this
special case better. For example, if the first N letters of a word are
the same as the first N of the previous word (including the 26 instances
of N=0), replace them by N encoded in base-36. So I think 500kB, if you
mean that, is an over-estimate. It's still a lot for an arbitrary Web
page; but not unreasonable even on dial-up if fetched on knowing demand
and cached.
Any idea where to find a reliable 25,000 word list?
If you are prepared to consider the spell-checker in a word processor
reliable, then just grab large quantities of plain text off the Net
(Project Gutenberg should have largely correct spellings, as should the
reports of your legislature), sort, deduplicate, and edit in the word
processor. If you take only lower-case words, you'll miss most proper
names.
I don't know how many items DOS sort or javascript sort will do in a
reasonable time, but there's always overnight. Via sig line 3, DEDUPE
is a DOS file line-deduplicator.
Actually, you don't *need* a word list. Any spelling checker should be
able to be told that the word it is currently complaining about is in
fact good, and to remember that either in the current document or
permanently. Start with an empty list, and after a few paragraphs it'll
know the words you commonly use, with your preferred spelling. You'll
just need one Webster lookup for each new word that you're not certain
how to spell.