Re: Suggestions for double-hashing scheme



websnarf@xxxxxxxxx wrote:
>
.... snip ...
>
> Another important note is about the size of the hash table itself.
> A common mistake in the literature and elsewhere is to pick the
> table size as a prime number. They do this in order to relax the
> constraints on h1() in order to guarantee that it traverses the
> table with a mere != 0 criteria. Hopefully my explanation above
> shows how naive and pedestrian such reasoning is. Worse yet is
> that the -> index = hashfunction(entry) % tablesize; line all of
> a sudden has a new bottleneck, namely the "%" function. As you
> well know, "%" (or equivalently division) by a non-constant
> number is basically one of the slowest operations you can do on
> any CPU outside of transcendentals or IO.
>
> The more sensible approach, of course, is to pick a table size
> that is a power of 2. Then you have index = hashfunction(entry)
> & (tablesize-1); which removes the "%" bottleneck ("&", as you
> know, basically costs nothing.) The earlier analysis of h1()
> above makes it clear that traversing the table will be a non-
> issue, if you simply choose them from a small table of primes.

Your argument against the use of primes for table size revolves
purely around the expense of a modulo operation, as I read it. The
substitutes require some sort of table search, possibly the small
primes table. Even if that is limited to the first 20 primes, say,
I maintain that that effort is going to be at least comparable, if
not greater, than the modulo calculation. Nothing except profiling
on an actual installation can answer that.

A further point is whether or not such modulo effort is significant
in the overall probe operation, which can only be answered after
the significance of a probe effort in the overall application is
evaluated.

All of these considerations probably pale in comparison with the
effect of cache hits on modern machinery. Again, profiling is the
only possible answer, and not very likely to give a definative
result. In practice the effectiveness of the cache will be highly
affected by the job mix on a particular machine.

Meanwhile the use of a prime table size has the advantage of
simplicity and robustness.

--
Some useful references about C:
<http://www.ungerhu.com/jxh/clc.welcome.txt>
<http://www.eskimo.com/~scs/C-faq/top.html>
<http://benpfaff.org/writings/clc/off-topic.html>
<http://anubis.dkuug.dk/jtc1/sc22/wg14/www/docs/n869/> (C99)
<http://www.dinkumware.com/refxc.html> (C-library}
<http://gcc.gnu.org/onlinedocs/> (GNU docs)


.



Relevant Pages

  • Re: Mod 2011
    ... and work in the ring of integers modulo q. ... The sequence always starts: ... What's the difference between primes in sequences and above? ... for those primes congruent 1 mod 3. ...
    (sci.math)
  • Re: Mod 2011
    ... and work in the ring of integers modulo q. ... The sequence always starts: ... What's the difference between primes in sequences and above? ... for those primes congruent 1 mod 3. ...
    (sci.math)
  • Re: About random, primes and statistics
    ... explain a simple area where mathematicians routinely ... modulo 3--perfect regularity: ... researchers as you have primes and you have ... composites and composites ...
    (sci.math)
  • Re: Quadratic residue method for finding primes
    ... Note that we know exactly which primes have 2 as a quadratic residue: ... those which are congruent to 1 or -1 modulo 8. ...
    (sci.crypt)
  • Re: Requesting some help
    ... invertibility modulo e or modulo n. ... to get the RSA-decryptions of smooth primes, ... "Smooth primes" here means small primes below some smoothness bound, ... keep the same number of columns of this; if you can invert ...
    (sci.crypt)