Minimum Memory Requirements




Suppose we wish to store K elements in a table.
For simplicity suppose that we need to produce
only a single bit (Yes/No for existence); the
table entries have no stored attributes.

Let L denote the inverse density of the elements
in their keying space. For example, when storing
200,000 9-digit social security numbers
L = 10^9 / 200000 = 5000
We assume L is large.

Let S denote the inverse density of false positives.
For all techniques except Bloom filter, we assume
a reliable table, ie. S = infinity.

How much memory is needed for the table?

For a Bloom filter memory
1.5 K * (log S) bits
For a hash table
1.2 K * (log K + log L + O(1)) bits
For a bit map
K * L bits
Theoretical minimum
K * (log L + 1.44) bits
Hashtable/Bitmap combination
1.1 * K * (log L + 3) bits

In this accounting, logs are to base 2, constants
which appear are just good approximations. I show
a O(1) term with hash tables: this can be made
quite close to zero but will be 32 bits or more for
most off-the-shelf hash routines.

The theoretical minimum is for elements randomly
distributed in their keying space. This minimum
can be approached with various methods, or even
exceeded when the distribution is non-random.

Is this all well known? I've not seen this summary
elsewhere, but I've not looked much.

Many in this newsgroup should be able to reproduce
the formulae above, but I'll elucidate further
if there's interest.

James Dow Allen (email jamesdowallen at gmail.com)

.



Relevant Pages

  • Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
    ... Further, for those sets of data, an XOR hash produces a sufficiently ... random distribution for hashes to work just fine. ... realm of developing a good generic algorithm; ... Saying that XOR produces collisions is not very useful. ...
    (microsoft.public.dotnet.framework)
  • Re: How hash tables work
    ... > memory or initiated memory on start up. ... if there is no match to your hash then you ... latter strategy can make it rather difficult to retrieve a known ... requires knowledge of the actual distribution. ...
    (alt.comp.lang.borland-delphi)
  • Re: why in class Boolean, hashcode() of "true" is 1231 and of "false" is 1237?
    ... The best hash functions require careful tuning for the distribution of inputs. ... If another input distribution actually happens, the same hash functions might present horrendous results. ... If it ever turns out that the hash function is particularly bad for your distribution, you could always write a wrapper class that picks a different hash code. ...
    (comp.lang.java.programmer)
  • Re: Question about tcp hash function tcp_hashfn()
    ... It was done to simulate socket code which uses the same folding. ... Since hash table size is definitely less than returned hash value, ... same distribution. ...
    (Linux-Kernel)
  • Re: why in class Boolean, hashcode() of "true" is 1231 and of "false" is 1237?
    ... That will have a pretty poor distribution of hash values, because the integer values that arise in actual running code don't tend to be completely randomly distributed. ... Integer as a key to a hash table isn't a ridiculously unlikely case, either -- it can be more efficient than an array if most of the array's cells are going to be nulls or zeros, and it's going to be large, and it can arise if you have heterogeneous keys some of which turn out to be integers. ... A sparse vector is one possible case -- for a hash table to be more efficient than an array for this job, the vector may need to have hundreds of dimensions, but even then the integer keys are distributed towards very small numbers in comparison to Integer.MAX_VALUE. ...
    (comp.lang.java.programmer)