Re: Hash
- From: rem642b@xxxxxxxxx (Robert Maas, see http://tinyurl.com/uh3t)
- Date: Fri, 29 Dec 2006 12:34:29 -0800
From: Jon Harrop <j...@xxxxxxxxxxxxxxxxx>
Hash functions (in my experience) typically terminate after a
finite number of atomic operations, so the complexity is not
related to the size of the key but, rather, just to the bounded
length of this computation, i.e. O(m).
You have to be very careful about such hash functions. For example,
suppose m=8, and suppose you've collected a database containing all
Jeopardy (TV show) clues and corresponding responses, and you want
to build a hash table that maps responses to clues. For example:
Category: Astronomy
Clue: It has four large moons which were discovered by Galileo.
Response: What is Jupiter?
Nearly all such responses would have the first 8 characters
identical (a few would start with "Who is " instead), so would hash
to the same bucket.
You'd have a similar problem hashing names of organic chemicals,
many of which start identically:
1,3 methyl 2 hydroxy toluene
1,3 methyl 2 hydroxy benzene
1,3 methyl 2 hydroxy glucose
1,3 methyl 2 hydroxy whatever
Deeper match, so even hashing first 21 characters gives
worst-possible hash within each such group, although not as many
collisions in each group because groups are smaller to begin with.
One idea to avoid being bitten by such prefix-collisions, yet still
not have to hash the whole key if it's rather long, is to fix the
total number of characters included in the hash computation (for
example always include 8 characters, except for strings shorter
than 8 characters), but to distribute them per uniform fraction
down the string:
1,3 methyl 2 hydroxy toluene
1,3 methyl 2 hydroxy benzene
1,3 methyl 2 hydroxy glucose
* * * * * * * *
= = = = = = / / (1-2 chars vary among members of the small group)
1,3 methyl iso-mumbleic acid
= = = / / / / / (5 chars different from sister group)
1,3 methyl 2 hydroxy very long name here
* * * * * * * *
/ / = / / / / / (totally different hash from above,
except if you're just adding the characters together rather than
computing CRC then 1 character matches so only 7 different from 1st g.)
By the way, I thought of this new algorithm just now in reading
this thread. I've never see it mentionned before. Is this a new
invention of mine, or did somebody else think of it before me?
.
- References:
- Prev by Date: Re: Shuffling a linked list
- Next by Date: Re: Shuffling a linked list
- Previous by thread: Re: Hash
- Next by thread: Re: Hash
- Index(es):
Relevant Pages
|
|