Re: Two-stage hashing (pre-hash big integer -> hash-array-index)

RobertMaas_at_YahooGroups.Com
Date: 03/07/04


Date: Sun, 07 Mar 2004 10:25:49 -0800


> From: "Joe \"Nuke Me Xemu\" Foster" <joe@bftsi0.UUCP>
> Date: Wed, 25 Feb 2004 17:51:02 -0800
> This seems much like double-hashing, except that in the case of a
> collision, the second hash function is applied to the first full
> hash value instead of the key to generate the probe sequence.
> URL:http://www.cogs.susx.ac.uk/local/teach/dats/notes/html/node108.html

   Double hashing is another alternative to linear probing. In this case
   we explore a sequence of probes
   h, h+k, h+2k, h+3k, h+4k, ...
   where k lies between 1 and table.length-1. The point of this approach
   is that not only h, but also k, should depend on the item that we have
   to insert. The idea is that even if two items hash to the same value
   of h, they will have different values of k, so that different probe
   sequences will be followed.

Well although the details are very different, indeed the basic idea (to
avoid all hashes with same home index following same collision chain,
and to avoid all intersecting collision chains linking up to make
double-long collision chains) is the same.

Still, by having k constant for a given key, any two keys whose h and k
are the same will follow exactly the same collision chain. I suppose
with a sufficiently large hash table, this is a sufficiently good
improvement over any collision-resolution method where k is a constant
(usually 1) or is a deterministic function of h. So with ten thousand
buckets, given that h is the same or close enough that a collision
chain intersects, there's only one chance in ten thousand that the
intersection is actually a concatenation of collision chains (if the
k's are by chance equal).

   Given the restriction on the range of k,
   the simplest choice for k is
   1 + (h(n) % (table.length-1))

Um, what's that h(n)?? Looking at the parent node, I see h is the
hashing function that computes a large integer from the key, but what
is n? d is the data object (presumably actually the key, not the whole
object associated with that key), and normally we compute h(d), then
h = hd (i.e. home index, author keeps changing notation)
  = h(d) % table.length (add 1 if using pseudocode).
Oh I see, d is the first data, e is the second data, ... n is the data
where a collision first occurs, specifically colliding with m.

So anyway, Quadratic probing has the problem that any two keys with the
same home index will follow exactly the same collision chain, although
it has the advantage (over linear probing) that different collision
chains which happen to intersect at one point will then immediately
diverge rather than append. So double hashing doesn't have either
problem hence is better.

But this whole question begs the question as to efficiency of comparing
keys. With my method, storing the 32-bit hash value within the hash
table, and comparing *that* instead of the actual key when searching
the table, searching is faster than with double-hashing (as given)
simply because you don't have to compare the entire key several times
when doing collision resolution.

So anyway, I'm still looking for anybody else's nicely written
description of any hashing method that includes *all* of the features
of what I was programming circa 1975:
- Like all the open-addressing methods in that Web site, it first
computes what I call the pre-hash, the large unsigned integer, from the
key, and only then uses that to compute the home index.
- Like double hashing on that Web page, it uses a probe sequence which
is not a constant offset (usually +1) per step, nor computed solely
from the home index, but instead is computed from the whole pre-hash
number in a way that makes the subsequent probe sequence
pseudo-randomly independent from the original probe (home index).
- Unlike any of the methods on that Web site, stores the pre-hash in
the hash table, which achieves (1) avoiding having to compare the whole
key for each probe, just compare the pre-hashes, and (2) avoiding
having to re-compute the hashs of the keys when resizing the hash table
due to growth, just do newHomeIndex = pre-hash % newTableSize using the
already computed and stored pre-hashes, and likewise newOffset also
computed from pre-computed pre-stored pre-hash.
- And actually more like what I was doing back then, there's no actual
hash table at all, no need to compute remainder modulo table size.
Instead the hashing computation stops at the pre-hash, and that number
is directly entered into a binary tree or radix tree or
array-of-mini-trees, which is balanced simply because the hash function
is pseudo-random. (How array of mini-trees works: You extract b
high-order bits from the pre-hash, which gives an index into an array
of size 2**b, each element of which is a small binary tree, so access
time is 1 + log2(n / 2**b) on the average, and when that logarithm
starts to get large you double the size of the array, splitting each
binary tree into two parts attached at different array elements.)

They key idea is that once you do the pseudo-random pre-hash
calculation, you don't then have to take remainder by table size and
fall into the traditional hash table methods, you can use that pre-hash
number for all sorts of other more-or-less different store&lookup
methods. Also, for example for my ProxHash algorithm, such pre-hash
values can be used for totally different purposes unrelated to direct
lookup on the "key" on which the pre-hash was computed.

If nothing turns up closer to what I was doing, maybe I'll write up my method using the same
notation as in that Web site and submit it for inclusion there, or just
post it here and wait for that Web author to notice it.

So anyway, Joe, thanks for finding that Web page that covers the part I
included in the Subject field and is close to the total of what I was
looking for.

By the way, does anybody use CRC32 as a hashing (pre-hash) function?
(Back in 1975 I used a linear congruential method, partly because I
didn't yet know about CRC, when was CRC invented anyway??, and partly
because the DEC PDP-10 had fast 35-bit unsigned multiplication
built-in, as part of 36-bit signed multiplication of course.)



Relevant Pages

  • Re: If not readdir() then what?
    ... 64 bits of hash, plus a 32-bit count value in the hash ... collision chain". ... you can get a hash collision with two entries. ...
    (Linux-Kernel)
  • Re: Panama hash collision question
    ... > No hash is literally collision free. ... We synchronize database systems by forming a checksum for each record ...
    (sci.crypt)
  • Re: keys and counters
    ... how many times can the counter be incremented before there is a collision in the hash, that is what i am asking. ... A hash function operated in such a counter mode as you suggest does not have this property - if I can guess or discover the input to the first block then I will know all the other blocks. ... You might think that some attacks are unreasonable/infeasible but do you really know what is possible to the world's largest employer of mathematicians, who have had for many years the world's largest computer budget and unlimited access to 60 plus years of classified research or what is possible for any of the other multi-billion dollar "smaller" big brothers?. ...
    (sci.crypt)
  • Re: Determining the encryption used
    ... impression that if a password verification system is checking passwords ... against a hash table, all you needed was a collision (as this would hash ... They involve generating two seperate hashes which have a collision. ... The collision attacks found can break the security of cryptographic ...
    (Pen-Test)
  • Re: "Collision for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD"
    ... this was the Year of Doom for cryptographic hash functions. ... These go into great detail on the SHA-0 and MD5 collisions ... Difficulty in the former is called "collision resistance", ... you probably meant to say was "I can find a *different* string whose ...
    (comp.os.linux.security)