Re: Hash Function problem

Hugo wrote:
I need to generate a hash key for (at max) 50 millions different entries (An
entry is a 200 characters length string). Since this key will be used as an
unique id, I want as few collisions as possible (ideally none).

For design purpose, I wish to keep my id on 8 bytes (64 bit) if possible.

I request your help on this 2 questions:

- Do you think a hash function with a 64 bit output will be enough to
guarantee almost no collision on 50 millions entries ?

If its a good hash function, it needs to output to at least about 52
bits. So 64 bits is fine.

- Which hash function do you suggest me to use ?

For 64 bits? I am not aware of a really good one taylored for 64 bits,
but obviously you can use SHA-2 (or even MD5 or SHA-1 if you don't
actually want security.) You might be able to modify other functions
like RC4.

I currently using MD5, truncating the output from 128 to 64 bit which is
quite a waste of CPU time.

160 to 64 you mean.

Until now I get no collision on tiny test corpus (500k entries), but I need
some advice to get further.

The Birthday paradox tells you that you need about twice as many bits
as the bits in the quantity of your entries in order to make the
expected number of collisions in the whole system to be 1. I.e., one
collision in a universe of 50 million entries is not bad.

Thanks in advance,

PS: (Feel free to point me out a more appropriate newsgroup, if any)

Even though its not actually crypto related, the folks in sci.crypt can
set you straight on this.

Paul Hsieh