Re: fast dictionary search algorithm

From: Gerry Quinn (gerryq_at_indigo.ie)
Date: 02/09/04


Date: Mon, 09 Feb 2004 12:20:26 GMT

In article <l5JVb.23739$P32.9918@news-binary.blueyonder.co.uk>, xbunny <xbunny@eidosnet.co.uk> wrote:
>Prashanth Ellina wrote:
>
>> Hi,
>> I have a text file with words and meanings. The record count is
>> approximately 100,000. I need to develop an algorithm that will find a
>> requested word's meaning in the shortest time possible. The algorithm
>> may be a combination of existing search algorithms or an
>> improvisation.
>>
>> I have considered hashing and multiple level hashing as an option
>> but am not sure about the hash function. People tell me it is
>> difficult to avoid collisions. Have tried binary search, however need
>> something faster.
>>
>> any ideas?
>
>Im no expert but I would have thought a binary search for a hash from
>the source word would have been amply fast enough for a 100,000 word
>dictionary. So long as your compare function is simple then it should
>fly. I would try dealing with collisions the simplest way ie (chained
>buckets in the hash table) and see its good enough but you can always
>try double hashing if you like or something more estoric I suppose.
>Theres are lots of string hashing functions on the net, quick google
>reveals some really fancy ones!

If the words come as ASCII chars and have 26 letters plus hyphen and
null character, I suppose one might in principle use a cunning sequence
of arrays based on converting one or more letters into an integer, then
using it as an array offset. The offset would contain a code which
represents either a pointer to another array which takes an offset based
on the next few letters, or to a found meaning, or to one or more more
compact search methods that are used when multiplying further arrays
would use up too much memory.

Say we have "vertiginous":
Make an integer offset from "ver"
Now there are maybe 500 words (guessing)
Code says take the next three letters and offset from a second array
that has entries for "teb" and "tig" among others
Calling second array with "tig" offset
Only five words start with "vertig", code points to an array of strings
in which a binary search is conducted by text comparison on the
remainder of the string ("inous\0").

That might be faster than a hash and 17 integer comparisons, followed by
collision testing.

Of course the program that sets up the tables would have to be quite
sophisticated in order to optimise memory use.

Gerry Quinn

-- 
http://bindweed.com 
Screensavers, Games, Kaleidoscopes
Download free trial versions


Relevant Pages

  • Re: Offset code for dynamic row# & multiple columns in LINEST func
    ... Hey Biff, ... Can the same be done to columns, meaning if the row is always 2 but the ... array coming accross the columns will vary? ... How would I offset the following. ...
    (microsoft.public.excel.worksheet.functions)
  • Re: What is the best way to pull out a range of values?
    ... Then you can use binary search to find the first element above the lower ... That's exactly what binary search does, in log n time for lists of ... I don't know for sure if the "data" array is static or it is being ... int chkndx = pos; ...
    (comp.lang.perl.misc)
  • Re: Buggy BinarySearch implementation, real life and a curious soul...
    ... typical daily array size? ... So either roll your own binary search function, ... So my question's audience are people who had both CL and Java ... the Modula-2 language specifies that overflow errors are detected: ...
    (comp.lang.lisp)
  • Re: Array descriptors
    ... There seem to be two ways of implementing the array descriptors ("dope ... where the base_address plus the offset defines the first memory ... Sun Fortran uses a third option. ... origin, your base address, and a virtual origin, the address ...
    (comp.lang.fortran)
  • Re: Offset/match returns #value error
    ... Yes the formula in the original post works fine using INDEX instead of ... OFFSET. ... The other formulas come from elsewhere in the workbook and were ... What I was doing in my last post was building an array using INDEX as follows: ...
    (microsoft.public.excel.worksheet.functions)

Loading