Re: Hash Function problem



CBFalconer wrote
(in article <4405BFB6.B3E7E877@xxxxxxxxx>):

Hugo wrote:

I need to generate a hash key for (at max) 50 millions different
entries (An entry is a 200 characters length string). Since this
key will be used as an unique id, I want as few collisions as
possible (ideally none).

For design purpose, I wish to keep my id on 8 bytes (64 bit) if
possible.

I request your help on this 2 questions:

- Do you think a hash function with a 64 bit output will be
enough to guarantee almost no collision on 50 millions entries ?

- Which hash function do you suggest me to use ?

I currently using MD5, truncating the output from 128 to 64 bit
which is quite a waste of CPU time.

Until now I get no collision on tiny test corpus (500k entries),
but I need some advice to get further.

You will always get collisions, unless you know all your data in
advance and generate a perfect hash for it. That is an impossible
task for this sort of volume. So you need an appropriate strategy
for dealing with collisions, and for minimizing them to an
acceptable degree.

Following up on that, you rarely need an ironclad guarantee of
no collisions, especially with that number of entries. What you
typically *do* need is a good hash function that doesn't do
weird things if you have a lot of similar (almost identical)
values, such as string data with a lot of commonality. Some
"generic" hash functions bite badly with that type of data.
There are lot of good functions around though, and CBF's package
makes it relatively easy to test several with a given data set
and see how it works out.

--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw





.



Relevant Pages

  • Re: Suggestions for double-hashing scheme
    ... chain style and reprobe style are basically a wash. ... will be a smaller chance of encountering deleted entries before it. ... Once you sufficiently optimize a hash table, ... by computing of the hash function). ...
    (comp.programming)
  • Re: Hash Function problem
    ... I need to generate a hash key for 50 millions different ... entries. ... key will be used as an unique id, I want as few collisions as ...
    (comp.programming)
  • Re: Suggestions for double-hashing scheme
    ... >> Once you sufficiently optimize a hash table, the big cost in hashing ... >> by computing of the hash function). ... any good hash function basically always has a minimal ... >> What you want is for each of these entries to have a different reprobe ...
    (comp.programming)
  • Re: Simple Algorithm (Perfect Hashing?)
    ... a hash algorithm does something similar but in a one-way fashion and ... Collisions are naturally not permitted! ... In fixed hash tables you use a table whose length is a prime number. ... the hash function looks like ...
    (comp.programming)
  • Re: Hashing
    ... A few of the texts indicate that certain hash ... > capacity to avoid hashing collisions within the application domain. ... an entry as "empty" because it might be along a rehash path for some ... for a creation of a "general hash function". ...
    (alt.lang.asm)