Re: Hashtable performance metric
- From: Moi <root@xxxxxxxxxxxxxxxxxxx>
- Date: Wed, 29 Oct 2008 12:15:59 +0100
On Wed, 29 Oct 2008 10:03:28 +0100, Michael wrote:
Hi Moi,
Moi wrote:
A problem that has been on my mind for a long time: how to express the
performance of a hashtable, preferably in one number.
Currently I seem to prefer the "number of hops" needed to access all
elements (I mostly use separate chaining) :
SUM(chainlength-1)/number_of_entries
, and I keep an eye on the longest chain. example (this table is not
very populated) :
### Slots: Total=512 Empty=426 Used=86 (0.167969); ### Nodes=100
Hops=16 (0.160000) Longest chain=3
The 0.16 number means: on average it takes 1.16 accesses to find one
element.
Sometimes I do use histograms of chain lengths , but in fact I tend to
only look at the (chainlength >1 vs chainlength=1) ratio.
you may interpret the pairs (slot_i, length of coll.list in slot_i) as
entries in a histogram. Your hash table / hash function works optimal if
this histograms shows a uniform distribution.
You may check this hypothesis (e.g.) by computing the variances of the
collision list lengths -- however, these variances may not be easy to
interpret. A probably better measure can be computed by using a chi^2
test. This should work as follows (see your favorite statistics textbook
for details, no theory here):
Assume your hash table A has k slots and you are hashing k*E values into
it. Then you're expecting a collision list of length E in each slot.
Your chi^2 value chisq is computed by (^ is NOT the bitwise xor here;) )
chisq = 0
for (i=0; i<k; i++)
chisq += (A[i].length - E)^2 / E
Since your A[i].length are random numbers the value chisq is a random
number, as well. The value chisq follows a chi^2 distribution with k-1
degrees of freedom and you may use the CDF of this distribution in order
to describe how probable it is to observe the value chisq for a uniform
distribution:
p = 1-chi2cdf(chisq,k-1)
This "p-value" is between 0 and 1. The larger this value, the better is
your hash function. For example, values >0.1 indicate that there is
probably a uniform distribution and your hash function works fine while
smaller values indicate a performance problem.
the chi2cdf function is available in "Octave" (http://www.octave.org) or
as a C implementation from http://www.netlib.org/cephes/ (in "cprob" /
"chdtr.c").
Thanks. That was the first on-topic reply.
If I understood correctly, your 'E' constant is what programmers would
call the 'load-factor': the expected (mean) chainlength.
My gut-feeling is that the chisquare test is too picky: it will test
whether the observed distribution (the length histogram) is within the
bounds of what could be expected, while all I only care about is the total
pathlen. (the number of 'hops' to retrieve all elements)
Also: in real life it may not be possible to create a full histogram,
because there are too many possible length (or it makes the code heavy)
The difference will probably be small; in real life the histograms 'look'
normal. (at least: decreasing counts with increasing chainlength).
I'll consult my statistician...
I'll try to incorporate the chisquare into my statistics, and repost here.
The example I gave previously uses E < 1.0; which is more or less
pathological (it is a special case because of the (known) limited size. It
is a zobrist-hash table to detect superko in my go program. )
Next time I'll use my diskhash program which has a fixed E=1. That will
make the math easier ;-)
Thanks,
AvK
AvK
.
- Follow-Ups:
- Re: Hashtable performance metric
- From: CBFalconer
- Re: Hashtable performance metric
- References:
- Hashtable performance metric
- From: Moi
- Re: Hashtable performance metric
- From: Michael
- Hashtable performance metric
- Prev by Date: Re: Hashtable performance metric
- Next by Date: Re: Programming aspects of the Acorn case in Ohio
- Previous by thread: Re: Hashtable performance metric
- Next by thread: Re: Hashtable performance metric
- Index(es):
Relevant Pages
|