Re: Hashtable performance metric



On Wed, 29 Oct 2008 10:03:28 +0100, Michael wrote:

Hi Moi,

Moi wrote:
A problem that has been on my mind for a long time: how to express the
performance of a hashtable, preferably in one number.

Currently I seem to prefer the "number of hops" needed to access all
elements (I mostly use separate chaining) :

SUM(chainlength-1)/number_of_entries

, and I keep an eye on the longest chain. example (this table is not
very populated) :

### Slots: Total=512 Empty=426 Used=86 (0.167969); ### Nodes=100
Hops=16 (0.160000) Longest chain=3

The 0.16 number means: on average it takes 1.16 accesses to find one
element.

Sometimes I do use histograms of chain lengths , but in fact I tend to
only look at the (chainlength >1 vs chainlength=1) ratio.

you may interpret the pairs (slot_i, length of coll.list in slot_i) as
entries in a histogram. Your hash table / hash function works optimal if
this histograms shows a uniform distribution.

You may check this hypothesis (e.g.) by computing the variances of the
collision list lengths -- however, these variances may not be easy to
interpret. A probably better measure can be computed by using a chi^2
test. This should work as follows (see your favorite statistics textbook
for details, no theory here):

Assume your hash table A has k slots and you are hashing k*E values into
it. Then you're expecting a collision list of length E in each slot.
Your chi^2 value chisq is computed by (^ is NOT the bitwise xor here;) )

chisq = 0
for (i=0; i<k; i++)
chisq += (A[i].length - E)^2 / E

Since your A[i].length are random numbers the value chisq is a random
number, as well. The value chisq follows a chi^2 distribution with k-1
degrees of freedom and you may use the CDF of this distribution in order
to describe how probable it is to observe the value chisq for a uniform
distribution:

p = 1-chi2cdf(chisq,k-1)

This "p-value" is between 0 and 1. The larger this value, the better is
your hash function. For example, values >0.1 indicate that there is
probably a uniform distribution and your hash function works fine while
smaller values indicate a performance problem.

the chi2cdf function is available in "Octave" (http://www.octave.org) or
as a C implementation from http://www.netlib.org/cephes/ (in "cprob" /
"chdtr.c").


Thanks. That was the first on-topic reply.

If I understood correctly, your 'E' constant is what programmers would
call the 'load-factor': the expected (mean) chainlength.

My gut-feeling is that the chisquare test is too picky: it will test
whether the observed distribution (the length histogram) is within the
bounds of what could be expected, while all I only care about is the total
pathlen. (the number of 'hops' to retrieve all elements)
Also: in real life it may not be possible to create a full histogram,
because there are too many possible length (or it makes the code heavy)

The difference will probably be small; in real life the histograms 'look'
normal. (at least: decreasing counts with increasing chainlength).
I'll consult my statistician...


I'll try to incorporate the chisquare into my statistics, and repost here.
The example I gave previously uses E < 1.0; which is more or less
pathological (it is a special case because of the (known) limited size. It
is a zobrist-hash table to detect superko in my go program. )
Next time I'll use my diskhash program which has a fixed E=1. That will
make the math easier ;-)

Thanks,
AvK


AvK

.



Relevant Pages

  • Re: Hashtable performance metric
    ... this histograms shows a uniform distribution. ... You may check this hypothesis by computing the variances of the ... Your chi^2 value chisq is computed by ) ... your hash function. ...
    (comp.programming)
  • Re: Hashtable performance metric
    ... this histograms shows a uniform distribution. ... You may check this hypothesis by computing the variances of the ... Your chi^2 value chisq is computed by ) ... your hash function. ...
    (comp.programming)
  • Re: RandomDirichlet?
    ... discovering an inverse function. ... for any discrete distribution), ... can give an estimate of how likely it is that a particular histogram ... of CACM there's an article by Raj Jain and Imrich Chlamtac describing ...
    (comp.lang.java.programmer)
  • Re: SHA1 digest output distribution
    ... plus secret) with input length typically less than 50 bytes and then ... (or any other good cryptographic hash function) ... adversary doesn't know much about the distribution... ... indistinguishable from the uniform distribution, ...
    (sci.crypt)
  • Re: Image noise estimation and filtering
    ... I also run the time series through ImageJ and made a background subtraction: ... the same histogram from the background subtracted images looks like this: ... do you see laser speckle? ... , then the distribution takes ...
    (comp.soft-sys.matlab)