Re: Hash Function problem
 From: websnarf@xxxxxxxxx
 Date: 1 Mar 2006 20:57:41 0800
Hugo wrote:
I need to generate a hash key for (at max) 50 millions different entries (An
entry is a 200 characters length string). Since this key will be used as an
unique id, I want as few collisions as possible (ideally none).
For design purpose, I wish to keep my id on 8 bytes (64 bit) if possible.
I request your help on this 2 questions:
 Do you think a hash function with a 64 bit output will be enough to
guarantee almost no collision on 50 millions entries ?
If its a good hash function, it needs to output to at least about 52
bits. So 64 bits is fine.
 Which hash function do you suggest me to use ?
For 64 bits? I am not aware of a really good one taylored for 64 bits,
but obviously you can use SHA2 (or even MD5 or SHA1 if you don't
actually want security.) You might be able to modify other functions
like RC4.
I currently using MD5, truncating the output from 128 to 64 bit which is
quite a waste of CPU time.
160 to 64 you mean.
Until now I get no collision on tiny test corpus (500k entries), but I need
some advice to get further.
The Birthday paradox tells you that you need about twice as many bits
as the bits in the quantity of your entries in order to make the
expected number of collisions in the whole system to be 1. I.e., one
collision in a universe of 50 million entries is not bad.
Thanks in advance,
PS: (Feel free to point me out a more appropriate newsgroup, if any)
Even though its not actually crypto related, the folks in sci.crypt can
set you straight on this.

Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
.
 FollowUps:
 Re: Hash Function problem
 From: websnarf
 Re: Hash Function problem
 References:
 Hash Function problem
 From: Hugo
 Hash Function problem
 Prev by Date: Re: Best copying stable sort algorithm?
 Next by Date: Re: Best copying stable sort algorithm?
 Previous by thread: Re: Hash Function problem
 Next by thread: Re: Hash Function problem
 Index(es):
Relevant Pages
