Re: Hash Function problem



Hugo wrote:
I need to generate a hash key for (at max) 50 millions different entries (An
entry is a 200 characters length string). Since this key will be used as an
unique id, I want as few collisions as possible (ideally none).

For design purpose, I wish to keep my id on 8 bytes (64 bit) if possible.

I request your help on this 2 questions:

- Do you think a hash function with a 64 bit output will be enough to
guarantee almost no collision on 50 millions entries ?

If its a good hash function, it needs to output to at least about 52
bits. So 64 bits is fine.

- Which hash function do you suggest me to use ?

For 64 bits? I am not aware of a really good one taylored for 64 bits,
but obviously you can use SHA-2 (or even MD5 or SHA-1 if you don't
actually want security.) You might be able to modify other functions
like RC4.

I currently using MD5, truncating the output from 128 to 64 bit which is
quite a waste of CPU time.

160 to 64 you mean.

Until now I get no collision on tiny test corpus (500k entries), but I need
some advice to get further.

The Birthday paradox tells you that you need about twice as many bits
as the bits in the quantity of your entries in order to make the
expected number of collisions in the whole system to be 1. I.e., one
collision in a universe of 50 million entries is not bad.

Thanks in advance,

PS: (Feel free to point me out a more appropriate newsgroup, if any)

Even though its not actually crypto related, the folks in sci.crypt can
set you straight on this.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: Hash Function problem
    ... I need to generate a hash key for 50 millions different ... entries. ... key will be used as an unique id, I want as few collisions as ... Do you think a hash function with a 64 bit output will be ...
    (comp.programming)
  • Re: Hash Function problem
    ... I need to generate a hash key for 50 millions different ... entries. ... key will be used as an unique id, I want as few collisions as ...
    (comp.programming)
  • Re: Cryptographic Hash function
    ... always produces a fixed length string as output. ... mean that the output string will always be longer than the ... The hash function generally does not care about the length ... So yes, collisions are bound to be a mathematical possibility, ...
    (sci.crypt)
  • Re: Re-rolled Salsa20 function
    ... it seems difficult to find collisions in the 64-byte string ... I'll note one can find collisions in this hash function in approximately ... 2^87 time and space by using the generalized birthday attack. ...
    (sci.crypt)
  • Re: Hash Function problem
    ... entry is a 200 characters length string). ... I want as few collisions as possible. ... the number of entries approaches the square-root of the key-space. ... FWIW, I would probably huffman encode the string, and then compute a CRC-64 of the result. ...
    (comp.programming)