Re: Q: Programming spelling checker
- From: "Arthur J. O'Dwyer" <ajonospam@xxxxxxxxxxxxxx>
- Date: Fri, 17 Mar 2006 16:23:52 -0500 (EST)
On Fri, 17 Mar 2006, Mok-Kong Shen wrote:
I like to write myself a program in a high-level language to
do spelling checking. This surely need not be comparable to any
well-established software in efficiciency but nonetheless do quite
the same algorithmically, including employing the same format of
organizing the words of the dictionary in terms of characters and
using the same method of retrieval from it and doing comparisons,
etc. etc. Could someone kindly give pointers to good relevant
literatures?
I can think of three elementary ways to do spellchecking, off the
top of my head. I'm sure I've posted about them before; check the
archives (Google "spell check" in comp.programming). I've /emphasized/
key terms below that you might want to pay attention to in your
searching.
(1) The clever heuristic method, as detailed in "Software Tools."
Take a big /corpus/ of text in your desired field (medical? Shakespearean?
office memos?), and run a short program over the corpus to tally up
how many times each /N-gram/ appears, for N=3 or maybe N=4. For example:
AAA 0 times
AAB 0 times
...
THE 41869 times
THF 1 time
THG 0 times
THH 0 times
THI 12105 times
THJ 0 times
...
ZZZ 0 times
Now, store that data in a static data file (or a hard-coded array), and
run a similar short program over your target text --- whatever you're
trying to spell-check. If you see an N-gram in the target text whose
frequency in the corpus is very low (for some definition of "very low"),
print out the surrounding context and let the user decide what word was meant. For example, if the user types "throguh" instead of "through,"
your system will detect that "guh" is not a frequent 3-gram, and print
out the offending line for the user to correct.
This method won't by itself suggest "through" as a replacement, though
you could conceivably add another heuristic based on /edit distance/
that would know to prefer "through" to "throuhg" or "throthe."
This method has the advantage that it works for any language, any corpus, with a small static data file and no intrinsic need to know
anything about English in particular. It also accepts words based on
how "English" they look; for example, "chortle" would be accepted even
if it wasn't in the corpus, but "xyzzy" wouldn't. This is good and bad.
The major advantage of this method, though, and the only one that
really matters, is that it requires only a very small static data file.
(2) The naive dictionary search. Run a short program over your corpus
to extract a list of all the words in the corpus. Store that humongous
dictionary in a static data file. Compare each word in the target text
to the dictionary; if it's not in the dictionary, print out the offending
line for the user to correct.
You can use /edit distance/ or /Soundex/ to produce a list of words
similar to the offending word that are present in the dictionary, and
rank them by similarity, very easily.
This method has the advantage that it is easy to train --- if it doesn't accept "chortle" the first time, you can add "chortle" to the dictionary so it will know better in the future. (But it still won't accept the obvious "chortles, chortled, chortling.") You also don't really need a
large corpus; you can download free dictionaries from the Internet, or
even bootstrap the dictionary from an empty file by adding all the words
from a known-accurate document.
This method has the disadvantage of still requiring some guesswork.
Will your dictionary include the word "foe," or should you suspect that
most people who type "foe" really intended to type "for" instead?
For storing the dictionary, look up the /trie/ data structure, as
Richard suggested. It removes a lot of the storage requirements of
such a large dictionary, by exploiting the fact that, e.g., "chortle"
and "chortling" share their first six letters.
(3) The heuristic language-based approach. Some dictionaries include
the word "paper," as in "to paper a wall," but not "repaper" or its conjugations ("repapers, repapered, repapering"). A smarter English
dictionary would recognize these patterns in word formation and accept
words obviously built up from root words in the dictionary plus affixes
such as "re-", "un-", "-ing", and "-able". This is probably very hard
to get right --- for example, "inducement" is a word but "producement"
isn't --- and I don't know of any papers in this area (but that's not
because there aren't any; it's because I don't read enough papers). ;)
This method would require a dictionary --- though not quite as big
as the pure dictionary-based approach --- and a lot of thought about
the way English forms words.
The best approach would probably combine a few of these ideas; possibly all of them. Of course, any decent spell-checker that was actually trying to help the user correct running English text would have to involve some amount of grammar-checking, too, to deal with all the people who mix up "their" and "they're," or "form" and "from." No pure spell-checking algorithm will catch those mistakes, and they're the ones that get most
on a lot of people's nerves, I think.
HTH,
-Arthur
.
- Follow-Ups:
- Re: Q: Programming spelling checker
- From: James Dow Allen
- Re: Q: Programming spelling checker
- From: [jongware]
- Re: Q: Programming spelling checker
- References:
- Q: Programming spelling checker
- From: Mok-Kong Shen
- Q: Programming spelling checker
- Prev by Date: Re: serialization class and endian
- Next by Date: Re: Q: Programming spelling checker
- Previous by thread: Re: Q: Programming spelling checker
- Next by thread: Re: Q: Programming spelling checker
- Index(es):
Relevant Pages
|
|