Re: Suggestions for double-hashing scheme



Hi Clint. Still fighting the good fight in the different thinking
universe? :)

Clint Olsen wrote:
> I'm interested in coding up a double-hash implementation (rather than
> chaining). Why? Well, because it is an interesting problem, and I'm
> trying to see if an implmentation can work just as good as chaining to
> handle collisions.

Yes, its quite a doable problem. The considerations are actually quite
interesting. My own analysis indicates that, when optimally
implemented, chain style and reprobe style are basically a wash. The
reprobe style has the slight advantage in that it can automatically
self-scale to the size of the proble, whereas the chain style has to be
designed with an expectation about total data size up front.

> From the usual textbooks and papers, the issue with double hashing is
> handling deletion of elements. Deleting an element means we still have to
> mark that slot as a possible path on a probe sequence to another element -
> one that could have been bumped because this element is already in the
> table.

Well, this is a consequence of quadratic or other forms of non-constant
reprobe hashing.

> [...] This works ok, but then you have the issue of a series of deletes
> causing unusually long probe sequences to determine the non-existence of a
> particular key. Some possible solutions were an occasional rehash or
> re-insert of all the items in a fresh new table or even to move bumped
> elements into that vacant slot.
>
> Both of these mean that delete becomes more expensive. The latter idea
> sounds interesting, but I don't know how one could easily track bumped
> elements without visiting every item in the hash table to see if it was
> bumped and it was actually bumped from this slot.

If you are concerned with deleted entry overhead, one thing you can do
is improve "early match" characteristics by performing "sort to front"
upon any successful scan. That is to say, after you pay the price of
scanning a reprobe list with at least one included delete, what you do
is you swap the found entry with the earliest deleted entry in the
table. The idea is that the next time you search for that entry there
will be a smaller chance of encountering deleted entries before it.

But generally the best approach is to keep the reprobe lengths very
low. The theory says that once your hash table > 70% full (including
entries marked "delete") you should rebuild it with a larger size
(obviously you don't need to copy over "deleted" entries when you do
this.) Doubling the size is one sensible approach, but given the high
overhead of rebuilding, you might consider quadrupling (reducing the
rebuild overhead from 100% to 33%). You might also add other criteria
for rehashing, such as if the table is, say > 35%, but one of the
search chains ended up being longer than some predefined extreme length
(or something like that.)

Once you sufficiently optimize a hash table, the big cost in hashing
turns out to be the cost of *COMPARING* the entries (*copying* entries
(if you aren't using just refs) is the next secondary effect, followed
by computing of the hash function). So don't obsess too much over
"clever" techniques for reducing delete impact. You should think of
"Delete" as adding design complications, not performance problems.

Anyhow, coming back to what *IS* important. Since comparing is where
the real bottleneck is, you have to do anything possible to reduce this
impact. How can you do this? First of all, the "swap to front"
heuristic can help, but more importantly you need to do hash-value
*PRE-SCREENING*. That is to say, entries should retain their full
hash-value along with them in the table, and as you scan, you first
check that the hashes match (between the one in the table and the entry
you are searching for) before incurring the expensive cost of comparing
entries (if hash(e1) != hash(e2), which is fast to compute, then you
already know that e1 != e2). In this way, so long as your hash
function is reasonable, the comparison cost is absolutely minimized
(you basically only perform comparisons when its practically a forgone
conclusion that the entry is a match). (Remember, you keep the entire
*original* hash function along with each entry, not just the index for
h0() or h1().)

> In double hashing, the overall hash is calculated via:
>
> H(k) = h0(k) + n * h1(k)
>
> Where h0 and h1 are the hash functions and n is the probe attempt (0, 1, 2,
> ...).

Yes. But its important to note that the total number of bits required
to produce h0 and h1 is almost certainly < 32. This requires some
analysis:

Computing h1() is interesting because on the one hand it must not be 0
and each possibility must be coprime with the current table size and
other possible values for it. And on the other hand the range of
possible values of h1() does not need to be all that great.

First of all, if you don't already know about it, familliarize yourself
with the "Birthday Paradox":
http://en.wikipedia.org/wiki/Birthday_Paradox . Understanding this is
key to understanding how to properly design your reprobe strategy. If
you don't understand the Birthday Paradox, then you are going to have a
hard time designing a truly sound and efficient hash table.

Basically in a hash table that's starting to fill up, the worst
colliding initial h0() positions might have around 10 entries aliasing.
What you want is for each of these entries to have a different reprobe
skip value. The Birthday Paradox basically says that if you pick
random choices amoungst a palette of about 100 entries, then you can
expect about "1" additional collision in skip values. So this is quite
tolerable.

I also said above that the possible h1() choices should be coprime with
each other. Hopefully this is easy to understand -- if h1() were to
output 3 then 6, for example, then chains that followed those probe
values from the same h0() would intersect quit a bit! Whereas, if it
were to output 5 and 7, then their common intersections are a lot less
minimal. So being coprime with each other makes a big difference.

Ok, so what does this all mean? It means that h1() essentially doesn't
need more than about 7 bits worth of degrees of freedom, and should be
taken from a table of numbers which are non-zero and coprime which each
other, as well as the size of the hash table itself. An array of the
first 128 primes starting from, say 11 (to set a lowest expected
interesection length that is greater than the greatest expected reprobe
list), indexed from a uniform 7 bit hash function is sufficient. There
are other choices, but this is clearly the most obvious one that fits
the above described criteria.

So why go into all that much detail just on h1()? The key thing is to
realize that it uses so few bits (only 7) from a hash function which is
likely to output so many more bits. Furthermore the h0() function
itself is generally going to use far less than 32 bits. So hopefully
this makes it clear that the right design is to use a *SINGLE* hash
function that outputs, say, 32 bits, and split off as many bits as is
required for h0(), and use the remaining bits (up to 7 of them) as the
index for h1(). If you look at Bob Jenkin's page, or my page on hash
functions you will see that basically all high performance hash table
hash functions output 32 bits (or more.) So this seems to be the
precise sweet spot for hash function/table design.

Although this seems to limit the hash table to 33 million entries,
actually, the "7 bits for h1()" is overspecified. That is to say, as
your hash table grows beyond 33 million entries, you can simply
cannibalize "h1 bits" for use in h0. My own testing indicates that
h1() is still effective with as low as 3(!) bits for indexing (for half
a billion entries) and of course it still technically functions even if
there no bits available for h1() (at which point you should be thinking
about 64 bit solutions anyways -- this is the one redeeming aspect of
the FNV hash function, since 64 a bit version of it exists.)

Another important note is about the size of the hash table itself. A
common mistake in the literature and elsewhere is to pick the table
size as a prime number. They do this in order to relax the constraints
on h1() in order to guarantee that it traverses the table with a mere
!= 0 criteria. Hopefully my explanation above shows how naive and
pedestrian such reasoning is. Worse yet is that the -> index =
hashfunction(entry) % tablesize; line all of a sudden has a new
bottleneck, namely the "%" function. As you well know, "%" (or
equivalently division) by a non-constant number is basically one of the
slowest operations you can do on any CPU outside of transcendentals or
IO.

The more sensible approach, of course, is to pick a table size that is
a power of 2. Then you have index = hashfunction(entry) &
(tablesize-1); which removes the "%" bottleneck ("&", as you know,
basically costs nothing.) The earlier analysis of h1() above makes it
clear that traversing the table will be a non-issue, if you simply
choose them from a small table of primes.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: Some comments on "super fast hash"
    ... SFH seems reasonably good and certainly is fast. ... > a hash, and SFH does not. ... The latest versions of each hash function which leverages this ... it must behave worse on other key sets. ...
    (comp.programming)
  • Re: Suggestions for double-hashing scheme
    ... >> Once you sufficiently optimize a hash table, the big cost in hashing ... >> by computing of the hash function). ... any good hash function basically always has a minimal ... >> What you want is for each of these entries to have a different reprobe ...
    (comp.programming)
  • Some comments on "super fast hash"
    ... I've implemented a hash function here: ... SFH seems reasonably good and certainly is fast. ... quality of the hash function is not affected by the difference as far ... it must behave worse on other key sets. ...
    (comp.programming)
  • Re: Maximum String size in Java?
    ... >> compilation on any new target platform that does not already have ... Do you have a version of SFH posted with changes to use this file ... If they intend to use a hash ... benefit of 31/33 will sway me into using more than one hash function. ...
    (comp.programming)
  • Re: Algorimic Complexity Attacks
    ... > keyed hash. ... If the secret itself is not leaked in the attack (and it ... this does have its difficulty: maintaining existing entries. ... This means the attack will be thwarted if the secret hash function (e.g. ...
    (Bugtraq)