Thank you, Michael Jørgensen (Was: Crossword generator)

From: Mensanator (mensanator_at_aol.compost)
Date: 11/15/04


Date: 14 Nov 2004 23:25:15 GMT

For posting this link:

> I used a dictionary consisting of 115879 words based on the public domain
> portion of "The Project Gutenberg Etext of Webster's Unabridged Dictionary"
> downloaded from
> <http://www.translatum.gr/dictionaries/download-english.htm>

I went there and found there are two sets of files, the original Project
Gutenberg files and a set of html files. The originals, although simple
ASCII text files, are in some dictionary markup language unsuitable for
adding to my word list database. So I downloaded the html files.

I had to convert them by inserting delimiters so I could import into
MS-Access. This turned out to be harder than it should have been
because these html files are riddled with errors and a couple corrupt
entries were giving my regular expression parser fits.

So I changed my program to search for corrupt entries. The corruption
appeared to be definition fragments missing the word and part of speech
fields. I had it scan for lines beginning with a lower case letter since,
for some reason, every entry is in initial caps (making the proper nouns
indistinguishable).

In addition to the corrupt records, I found numerous cases where the
entries were not capitalized. But that's not all! In the "Q" file,
we find the following entry:

Oueen-post (n.) one of two suspending posts in a roof truss...see King-post

Corrupt records are one thing, but spelling errors? ... in a DICTIONARY?!

I only found it because it wrong letter was the first letter in the word.
My program didn't even notice that "Semious" wasn't a word. But a quick
look at the definition gives you a clue:

Semious (a.) Of or pertaining to the Sim/; monkeylike.

Monkeylike? And what's with the slash? Is that slashed word supposed to
be "Simian"? Yes, and the entry should be "Simious", not "Semious".
There are dozens of these slashes scattered throughout the html files
apparently marking places where errors occured (while scanning?). Many of
them mark places where there was no equivalent ASCII character, others
are inexplicable.

Many of them are easily correcable from context:

Stay ... to hinde/
Squirm ... to twist about briskly with contor/ions
Stuck-up ... /onceited; vain

Others are a little harder. I still haven't figured this one out:

Utilitarian ... /iming at utility as distinguished from beauty

Although the slashes are relatively easy to locate and correct,
the apparent lack of quality control makes this dictionary of dubious
value. I only found "Semious" because the definition had a slashed
word. How many mispelled entries don't have slashes?

So why am I thanking Michael Jørgensen for pointing me to this funky
word list?

To check the bad entries, I dug up an old Webster's Unabridged
Dictionary that belonged to my father. It was a two volume set
published in 1937. While thumbing through it (looking up "Semious")
I found a slip of paper tucked inside.

It was a receipt for the rental of a post office box. It was stamped
Jan 1, 1940 and bore my grandfather's signature. I never knew my
grandfather, he died before I was born.

Thanks again Michael, finding that slip of paper really made my day.

-- 
Mensanator
Ace of Clubs


Relevant Pages

  • Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.
    ... the rt group hierarchy got corrupted by always pointing the entity's ... dequeue_rt_stack like regular task enqueue/dequeue do. ... can leave empty groups and possibly corrupt the priority queues. ... entries, we must remove entries top - down. ...
    (Linux-Kernel)
  • Re: Charlie bingo
    ... Entries in a brown envelope please, ... praise of a man who did terrible harm to Ireland. ... If that is retrospective acknowledgement of his part in the defence of the Constitution, then the Armed Forces of the State have become corrupt and it a serious practacal matter. ...
    (soc.culture.irish)
  • Re: Problem with Auto Name Competing in Outlook 2003
    ... If I type the letter ''v'' then the entries in the dropdown list are correct. ... appear corrupt? ... Russ Valentine "Shidewa" wrote in message ...
    (microsoft.public.outlook)
  • Re: Windows Update
    ... Most likely, a file got corrupt. ... Which one - dunno. ... MS knowledgebase has ... many entries for the R6025 error, ...
    (microsoft.public.security)
  • Re: Formatt date from text
    ... If you want the entries to actually be XL dates, ... Hoot wrote: ... with the slashes, ie; fromatt 092207, to 09/22/07. ...
    (microsoft.public.excel.worksheet.functions)