Thank you, Michael Jørgensen (Was: Crossword generator)
From: Mensanator (mensanator_at_aol.compost)
Date: 11/15/04
- Next message: Michael Mendelsohn: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Previous message: Willem: "Re: Need idears to solve an algorithmic problem"
- Next in thread: Michael Mendelsohn: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Reply: Michael Mendelsohn: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Reply: Michael Jørgensen: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 14 Nov 2004 23:25:15 GMT
For posting this link:
> I used a dictionary consisting of 115879 words based on the public domain
> portion of "The Project Gutenberg Etext of Webster's Unabridged Dictionary"
> downloaded from
> <http://www.translatum.gr/dictionaries/download-english.htm>
I went there and found there are two sets of files, the original Project
Gutenberg files and a set of html files. The originals, although simple
ASCII text files, are in some dictionary markup language unsuitable for
adding to my word list database. So I downloaded the html files.
I had to convert them by inserting delimiters so I could import into
MS-Access. This turned out to be harder than it should have been
because these html files are riddled with errors and a couple corrupt
entries were giving my regular expression parser fits.
So I changed my program to search for corrupt entries. The corruption
appeared to be definition fragments missing the word and part of speech
fields. I had it scan for lines beginning with a lower case letter since,
for some reason, every entry is in initial caps (making the proper nouns
indistinguishable).
In addition to the corrupt records, I found numerous cases where the
entries were not capitalized. But that's not all! In the "Q" file,
we find the following entry:
Oueen-post (n.) one of two suspending posts in a roof truss...see King-post
Corrupt records are one thing, but spelling errors? ... in a DICTIONARY?!
I only found it because it wrong letter was the first letter in the word.
My program didn't even notice that "Semious" wasn't a word. But a quick
look at the definition gives you a clue:
Semious (a.) Of or pertaining to the Sim/; monkeylike.
Monkeylike? And what's with the slash? Is that slashed word supposed to
be "Simian"? Yes, and the entry should be "Simious", not "Semious".
There are dozens of these slashes scattered throughout the html files
apparently marking places where errors occured (while scanning?). Many of
them mark places where there was no equivalent ASCII character, others
are inexplicable.
Many of them are easily correcable from context:
Stay ... to hinde/
Squirm ... to twist about briskly with contor/ions
Stuck-up ... /onceited; vain
Others are a little harder. I still haven't figured this one out:
Utilitarian ... /iming at utility as distinguished from beauty
Although the slashes are relatively easy to locate and correct,
the apparent lack of quality control makes this dictionary of dubious
value. I only found "Semious" because the definition had a slashed
word. How many mispelled entries don't have slashes?
So why am I thanking Michael Jørgensen for pointing me to this funky
word list?
To check the bad entries, I dug up an old Webster's Unabridged
Dictionary that belonged to my father. It was a two volume set
published in 1937. While thumbing through it (looking up "Semious")
I found a slip of paper tucked inside.
It was a receipt for the rental of a post office box. It was stamped
Jan 1, 1940 and bore my grandfather's signature. I never knew my
grandfather, he died before I was born.
Thanks again Michael, finding that slip of paper really made my day.
-- Mensanator Ace of Clubs
- Next message: Michael Mendelsohn: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Previous message: Willem: "Re: Need idears to solve an algorithmic problem"
- Next in thread: Michael Mendelsohn: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Reply: Michael Mendelsohn: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Reply: Michael Jørgensen: "Re: Thank you, Michael Jørgensen (Was: Crossword generator)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|