Re: Some comments on "super fast hash"
websnarf_at_gmail.com
Date: 03/17/05
- Next message: Randy: "Re: job hunting etiquette: tell headhunter about others?"
- Previous message: Pasacco: "Re: assembly,subroutine call vs conditional branch"
- In reply to: Tim Rentsch: "Some comments on "super fast hash""
- Next in thread: Tim Rentsch: "Re: Some comments on "super fast hash""
- Reply: Tim Rentsch: "Re: Some comments on "super fast hash""
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 17 Mar 2005 11:12:47 -0800
Tim Rentsch wrote:
> websnarf@gmail.com writes:
> > [snip] the conversation from where it started over
> > in sci.crypt. I've implemented a hash function here:
> >
> > http://www.pobox.com/~qed/hash.html
> >
> > Its very generic C code, that almost certain compiles on just
> > about any C compiler in existence. Its also has very good
> > properties, like complete bit avalanching and superior
> > performance. The problem is that its internal mechanism does not
> > work correctly unless it can use integers that are *exactly* 32
> > bits. Not less, and not more. The C standard before C99 doesn't
> > provide for this, so I just did whatever makes sense for the vast
> > majority of platforms out there.
>
> Forgive me, I'm not exactly responding to the comments above, but I
> wanted to give some comments on the "super fast hash" discussion
> that's appeared in several newsgroups (but now seems to have
> disappeared off my news server in all groups but this one).
>
> First comment is, SFH seems reasonably good and certainly is fast.
>
> It's nice that the author took the trouble to make the comparison
> test available. The performance figures (such as those shown on
> the web page indicated above) correspond to what I got when I ran
> the tests locally. Despite that however I think the figures are
> misleading, because there are several ways that they are an "apples
> and oranges" type of comparison:
>
> 1. The platform specific code of SHF is compared against
> platform non-specific code of everything else. A better
> test would have been to disable the platform specific code
> in SFH (basically, use the generic definition of 'getshort').
My intention was to present the code as it should be used in the real
world. Leveraging the x86's ability to deal with unaligned memory
accesses is something that should be there whenever possible. So the
right answer is to leverage this feature for Bob Jenkins (which my
latest update does) not to remove it for SFH.
The purpose is not to be an x86 versus RISC benchmark. Nevertheless
there are PPC numbers shown which definately uses the generic path
(and this is where SFH has the largest advantage).
> 2. Several of the hash functions have the capability to "restart"
> a hash, and SFH does not. These functions have an extra
> layer of function call around them, which obviously hurts
> their performance.
No it does not. The test hashes relatively long sequences, so the
only thing being measured is inner loop performance.
> [...] A better test would be to either wrap
> the SFH hash in an extra function call to have a similar
> capability, or modify the other hash functions so they
> don't need the extra function call layer - again, in the
> interest of a more apples and apples comparison.
This would do nothing to the performance, and ignores the fact that
there are other hash functions in the test which have the same
"problem".
> 3. Apparently no effort was made to apply the same platform
> specific code techniques to other hash functions that
> are used in SFH. In the Bob Jenkins hash, for example,
> the main loop
>
> while( len >= 12 ){
> a += (k[0] + ((ub4) k[1] << 8) + ((ub4) k[2] << 16) + ((ub4)
k[3] << 24));
> b += (k[4] + ((ub4) k[5] << 8) + ((ub4) k[6] << 16) + ((ub4)
k[7] << 24));
> c += (k[8] + ((ub4) k[9] << 8) + ((ub4) k[ 10 ] << 16) +
((ub4) k[ 11 ] << 24));
> mix( a, b, c );
> k += 12; len -= 12;
> }
>
> could be written (on a platform specific basis) as
>
> while( len >= 12 ){
> a += *(ub4*) &k[0];
> b += *(ub4*) &k[4];
> c += *(ub4*) &k[8];
> mix( a, b, c );
> k += 12; len -= 12;
> }
I was in email contact with Bob Jenkins. He tried precisely this same
change, and claimed he saw a performance *DROP* (on a Pentium), so my
omitting this seemed justified. I've since updated it with the
correct platform specific acceleration macro, and indeed the
performance of Bob Jenkin's hash goes up (on an Athlon).
> which makes the speed of Bob Jenkins hash* very nearly the
> same as SFH (specifically, SFH runs about 15% faster).
> {*: unwrapped as mentioned in (2)}
The latest versions of each hash function which leverages this
platform specific unaligned access trick, as well as some slight
pipeline improvements for SFH, shows that the performance difference
is between 23% and 66%.
> [Note: C experts will see that these two loops are not quite
> identical in their effect, even on "little endian" machines. The
> quality of the hash function is not affected by the difference as
> far as I can tell.]
The loop is semantically portable in my latest version of the code.
> There wasn't much said about the quality of the hash function,
> except for some not-very-detailed comments (as I remember them,
> anyway) about collisions.
SFH was designed/tested to have good "bit avalanching" and "bit
independence" as well as "injective" with small changes. Good
collision properties are a side effect of this.
> [...] It may seem paradoxical, but an experiment reporting
> fewer collisions doesn't necessarily mean a hash function is better;
> a good hash function is not one that behaves spectacularly well on
> some key sets (assuming of course that it's intended to be used on
> arbitrary sets of keys), but one that never behaves poorly on any
> key set.
Well, I would only add that it should be any *realistic* key set. Of
course, since we all output no more than 32 bits, we all fall to
pigeon hole patterns sooner or later. So what matters is just what
the nature of these patterns are, and how easily they manifest
themselves.
> [...] If a hash function behaves better than "random numbers" on
> some key set, it must behave worse on other key sets.
But its like compression -- you prefer to perform well on data that
shows up in the real world, at a very slight cost to unusual data.
> [...] The closer a hash function is to "random" behavior, the better
> we should expect it to behave.
Well, yes, but most of this "better behavior" will be for data we
never see or encounter.
> Example: if we hash 1000 "random" keys into a 1000 slot array
> (assuming bucket chaining), how many buckets should we expect? The
> answer is approximately 632 (or 63.2% of 1000). Ideally we would
> get about this many no matter which sets of 1000 keys we tried,
> assuming of course that the keys weren't chosen deliberately to
> cluster. Experiments with FNV hash and one-at-a-time hash show this
> behavior, as long as the table size isn't too large; when the
> table size gets about about 2**24, though, they don't do as well -
> the upper bits of these functions aren't as random as they should
> be.
>
> The Bob Jenkins hash does very well in this regard - very consistent
> at 60-64.5% across a broad range of table sizes (up to 2**32).
>
> The SFH also does well, except above 2**29 the rate starts going up
> - 68% at 2**30, 74% at 2**31, 97% at 2**32. The high order bits
> aren't as random as they should be. In case someone thinks this is
> a good thing, remember that the identity function hits every
> bucket, at 100%, but it's a very poor hash function.
Ok, this is a major misunderstanding of what is desired from a good
hash function. One of the key steps in and hash table (even if some
people are oblivious to it) is that key check prior to entry
duplication testing. This means that you want the probability of two
full keys matching to be absolutely minimzed, while lower bit
extractions should have a more random distribution since they will be
used for table indexes.
So in fact SFH is extremely well suited in this regard -- the lower
bits have the nice distribution that spreads the hash indexes as you
expect, but when using all 32 bits, the value is essentially
injective with up to 32 bits of input.
Another point is that the identity function actually is an *excellent*
hash function so long as the input size is lower than the size of the
hash output. The reason, quite simply, is that it defeats the
Birthday Paradox (which is an asymptotic weakness for all hash
functions) for small inputs. So the point is that SFH behaves
basically like a permuter (which is the same as the identity function
for primary hashing) for inputs of less than 32 bits. But SFH is
also appropriate for secondary hash bits, and precheck verifies.
You should find this also to be the case for the "One-at-a-time" and
CRC32 hashes. I also believe that all these hash functions can be
"fixed" to have the property you desire by adding the (up to) last 4
bytes as if it were a single 32 bit number to the hash value itself (I
haven't checked this, it just seems intuitive to me that this is what
would happen). That ought to remove the permutation property, and give
it mere random properties.
> Another test of hash function quality is how do output bits depend
> on input bits. The ideal is, if an input bit is changed, each
> output bit changes independently with 50% probability. Running
> this test on the one-at-a-time hash shows extremely good behavior
> on the low order 16 output bits - if any input bit is changed, each
> of the low order 16 output bits changes with probability 50% with
> 0.1% or so tolerance. Output bits 17-24 aren't quite as good - here
> the tolerance is about +/- 1% - and bits 25-32 are pretty bad -
> median tolerance is +/- 1%, and roughly 1/5 of all
> input-bit/output-bit combinations off by 10-26%. And that matches
> what we say with the previous test.
>
> By contrast, the Bob Jenkins hash is more well behaved on all bits.
> There is one outlier (out of 32*32 choices) at 15% tolerance; 8 in
> the 5-9% range; 46 in the 1-5% range; and the remaining 969
> tolerances all below 1%.
>
> The good bit combinations of SFH are comparable to those of the Bob
> Jenkins hash; the problem is there are more bad bit combinations in
> SFH. 95 in the 1-5% range; 74 in the 5-10% range; and 10 spread
> approximately evenly in the 10-37% range.
Ok first of all there is something wrong with your test. *BY DESIGN*
SFH never deviates outside of 5/12 and 7/12. Which is +-8.3%. So if
you saw anything in the 10-37% range, then you've made a mistake.
Secondly, this is a high bar for hash quality, and is generally not
relevant for practical usage. Literally, what this measurement means
is that SFH at its very worst will end up shifting the spread of
single bit input differences with a slight eccentricity of 8% more
into one half of a hash table than another. But since there can only
be at most two input data that differ by only one bit, and exactly
which half of the table is still primarily dictated by the other
bits, this just isn't a big deal.
By keeping the probability reasonably close to 50% (within 8% at
worst, and well under 1% for most) and more importantly making sure
that each input bit affected the output bits *independently*,
provides the real desired feature, which is to have any subset of
bits to evenly distribute all possible inputs to all possible outputs.
Finally, if the is that big a deal you can add the following line at
the end:
hash += (hash << 16) | (hash >> 16);
and the probabilities all end up being between .485 and .515.
> For a general hash function (where there is no special knowledge
> of the input key sets or hash table moduli), the Bob Jenkins
> hash looks better to me than SFH; the better quality of the
> hashing seems well worth a potential 10-20% performance penalty.
Ok, firstly, the performance penality is 23%-66%, and the quality
differences you've seen are first not accurately measured, and/or
misunderstood. This is what you want from a hash function, in terms
of collisions:
1. Collision patterns should not be simplistic or formulaic. I.e.,
some effort should need to be expended by an attacker to launch a
very carefully constructed attack sequence to force the hash
function to collide.
2. Minimize collisions on real world data. This means that birthday
paradox attacks with real world data should not be more successful
than, say, random data.
3. Hash functions should be appropriate for use as both a primary and
secondary hash function by simply splitting the bits. (Otherwise a
differently seeded rerun of the hash function is necessary, or
using a different hash function.)
4. Hash functions should function as well as possible for precheck
conditions. That is to say, if input1 != input2, then the
probability of fullhash(input1) == fullhash(intput2) is at most
1/(output range).
SFH, One-at-a-time, and Bob Jenkins do very well on these conditions,
while FNV, and CRC32 don't do quite as well, and the *31 and *37 are
basically in the same category as a straight checksum.
The conditions you are concentrating on are interesting things that
match a certain expectation you have about the you think hash
functions should work. But they don't correspond directly to anything
useful when used in practice. And the conditions you want are not
difficult to achieve with slight modifications to SFH that do not
affect its performance. They more correspond to crypto-hash
requirements (but are not sufficient) which none of these hash
functions are appropriate as anyhow.
-- Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/
- Next message: Randy: "Re: job hunting etiquette: tell headhunter about others?"
- Previous message: Pasacco: "Re: assembly,subroutine call vs conditional branch"
- In reply to: Tim Rentsch: "Some comments on "super fast hash""
- Next in thread: Tim Rentsch: "Re: Some comments on "super fast hash""
- Reply: Tim Rentsch: "Re: Some comments on "super fast hash""
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|