Re: About Hsieh's hash: initial value? 64 bit?



schoedl@xxxxxxxxx wrote:
are there any news out there about which value to use as initial one
when implementing Paul Hsieh's hash for incremental updates?

You can initialize it with *any* value, however a non-zero one usually
has slightly better start up properties (if you start with 0, and you
prefix the data with any number of 4-byte blocks of 0, then you will
"0-sink" and will collide on any common tails after any number of such
4 byte blocks of 0s). If you know the length in bytes of the whole
thing a priori, then using the length itself is a good choice because
the 0-sinking scenarios will decrease with different lengthed objects.

[...] Also, has anyone come up with a 64-bit version?

If anyone has, let me know. :) One thing you can do to create a
64-bit hash function is to use two different 32-bit hash functions and
simply concatenate the two functions. Bob Jenkins' function and mine
are certainly independent, so you could do that.

Unfortunately, my function very densely uses all the functional units
of an Athlon CPU, and therefore trying to loop jam my function with,
say, Bob Jenkins' one-at-a-time function will not actually lead to
improved performance versus just running one after the other -- however
I have not actually tested this (I could be wrong.)

[...] I want to use the hash as a
checksum (rather than a hash-table index) and want to avoid collisions
as much as I can.

Well if you just want to use it as a checksum, it is not optimal. In
particular look at the final "tempering code". It is totally
irrelevant to the purposes of being a checksum -- it further mixes a
static 32 bit value, to increase the avalanching, but not increasing
the input data content. This is clearly not needed for a checksum.
OTOH, experimentally, I have *seen* that my hash function has
checksum-like properties on ordinary text (I have not found a pair of
english words in ASCII which will hash to the same value with my
function).

CRCs of course, have a much longer history as checksums. My function
was only "discovered" two years ago. The 0-sinking issue in my
function is kind of obvious. On the other hand we *know* that CRCs
have an analogue to the 0-sink issue; its just harder to find. The
main advantage of using my function, of course, is that its much faster
(so is Bob Jenkins' function, BTW.) So you can try to use it
experimentally to see how it goes.

If you expect to be actively attacked on the checksum (meaning that
input might be carefully constructed to create false positives)
actually neither CRCs nor my function are satisfactory, and you should
use something like SHA-2.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: Any known fast hash algorithms ?
    ... different from hash and checksum has allways been a little bit abstract. ... Looking at your reply I would say that you call a function a hash when it ... that was the thinking behind the original LFSR based CRCs. ... CRCs are both weaker in terms of distribution and slower. ...
    (borland.public.delphi.language.basm)
  • Re: Really fast checksum?
    ... Using JDBC, I select records from a table, roll through them, calculate a checksum on the text of all of the fields, and then check it against a stored checksum to see if the record has changed. ... In the past I've used CRC32, but that only operates on bytes and I'd rather not convert all the strings I'll get from JDBC into bytes. ... No hash function, of any kind, is capable of giving you that guarantee. ...
    (comp.lang.java.programmer)
  • Re: Byte to byte compare, duplicate file finder/killer
    ... cryptographically strong hash, simply because the CRC ... By using a secure hash instead of CRC, the actual byte-to-byte compare ... checksum (files with identical CRCs or checksums are trivial to ...
    (comp.programming)
  • Re: application whitelisting (was RE: Active Directory Question)
    ... granular control of what user can and can not do. ... > hooks kernel calls to load a process, then compare a checksum ... An md5-like hash of the application is saved ... If you have a malicious program ...
    (Focus-Microsoft)
  • Re: Any known fast hash algorithms ?
    ... What I meant with that the principle is not that different from CRC, ... different from hash and checksum has allways been a little bit abstract. ... Looking at your reply I would say that you call a function a hash when it ... still end up with the best possible distribution of those bits. ...
    (borland.public.delphi.language.basm)