Re: type of hash functions
- From: websnarf@xxxxxxxxx
- Date: 21 Mar 2006 14:44:26 -0800
Adrian Wong wrote:
I am looking for two hash functions on strings. For different strings
that collide on one hash function it is guaranteed not to collide with
the other hash function. Do such complementary functions exist?
Not in general. Alex Fraser's post explains it quite succinctly. Its
a pigeon hole thing (you can't ram all the patterns from an arbitrary
number of bits into a fixed number of bit positions without any
repetition of the destination pattern).
That said, certainly you can do that in one hash function in a
probabilistic sense of "guarantee" with a hash function like SHA-2.
This is usually an acceptable candidate because SHA-2 is a strong
enough hash that there is no known collision that has ever been
computed on it, and it is purported that none will be found through any
exercise of computation for the next 100 years or so (there is reason
to believe that this may actually be false due to recent discoveries
about SHA-0 and SHA-1 (which are related functions) in the last couple
years, however its not clear, since SHA-2 has a lot more margin of
safety).
Your follow-up post seems to suggest that you are interested in speed
of hashing. For practical hash functions the two main requirements are
that 1) the hash function itself be as fast as possible, and 2) the
hash function minimizes the probability of collision as much as
possible. The point being that you don't actually avoid the call to
strcmp(), but rather reduce the probability that you *need* to as much
as possible.
There are two main "practical" hash functions that you should consider
for this: Bob Jenkin's Hash function, and mine. You can find therm
here:
http://burtleburtle.net/bob/hash/doobs.html
http://www.azillionmonkeys.com/qed/hash.html
Both functions are demonstrably much faster than alternatives, as well
as having much higher quality hashing properties. Bob's webpage goes
through sufficient analysis of other functions, that they should
generally be dismissed for *both* being too slow and of poor quality.
The main problem with SHA-2 kind of functions is that they are somewhat
slower and output way more than 32 bits as their hash, which can make
the mere process of storing and comparing hashes not worth the effort.
Bob Jenkin's function is slightly better in terms of collisions because
of a near 50% bit flipping property (my function achieves 45%-55% in
the worst case, but is near 50% for most input->output instance
correlations), and because it would be fairly hard to analytically
attack his function (but not impossible). My function is faster, and
in most practical conditions (where you don't expect to be deliberately
attacked) will behave as well Bob's function.
This can still be a problem if the strings you are comparing are very
commonly equal. In many cases comparing the container handles (the
pointer to the string) first can help (since many libraries omit this
acceleration). However, if you, in the end, are just forced to perform
string comparisons on mostly (or at least with a very significant
probability) equal strings, then you are going to want to look at the
*representation* of your underlying string.
Ordinary representations of C strings are problematic, because they
omit the length. If you knew the length before hand, then you could
use memcmp(), instead of strcmp() (its usually a lot faster to do this,
because the system can perform block based comparisons.) There are
platform specific ways of trying to implement strcmp() with block-like
performance, however they make very rare appearances in real world
compiler libraries (Solaris' compiler is an exception) are not portable
and doesn't quite equal the performance of memcmp(). More practically
what you can do is store your strings as <int length, char content[]>
tuples, and this will deliver you, given that you are forced to perform
some kind of IsEqual() function on strings that are probably equal, the
best performance alternative.
The punchline here, as regular readers here can see coming a mile away
by now, is that you can use "The Better String Library" to make this
string representation a no-brainer. You will find the URL to this at
the end of the post. The point being that you can use this as a
*complete* string library that just generally outperforms char * string
functions, and which happens to be very appropriate for the hashing
problem with high probability of equal strings.
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
.
- References:
- type of hash functions
- From: Adrian Wong
- type of hash functions
- Prev by Date: Re: How to find the maximum sum of atleast L consecutive integersgiven a sequence of N integers.
- Next by Date: Re: How to find the maximum sum of atleast L consecutive integersgiven a sequence of N integers.
- Previous by thread: Re: type of hash functions
- Next by thread: Connecting C To a Sql Database
- Index(es):
Relevant Pages
|
Loading