Re: Hashing

From: hutch-- (hutch_at_movsd.com)
Date: 03/03/05


Date: 2 Mar 2005 17:55:53 -0800


 Darran,

What you have to keep in mind is that a hash table for words in by
necessity an imperfect device and it is for a very simple reason. When
you work with decimal numbers, you have 10 variations where when you
work with characters you have with single byte types, 256 possibles and
that accounts for the compromise required in hash table design.

To make a perfect hash using the direct table method would involve
options to the power of 256 which is more memory than you can expect in
computers for a long time to come. The existing "perfect hashes" are
usually tree based and while they are collission free, they are slower
than a normal table based hash.

Most table based hashes have a seperate algo to generate the number
within a specified range from the original word so the result will fit
into the table size and they generally use the remainder of a division
which tends to give a reasonably well behaved distribution for a wide
variety of different words but none of these techniques are collission
free and this is why you need a collision handling technique.

The simplest ones increment the location if it is already used but this
leads to what is called clustering, a phenomenon where a long sequence
of adjoining "buckets" are filled and the technique slows dow with
linear testing of buckets. You can get around this by using variable
length increments and the word length is often useful to do this.

By spacing the change of bucket location on the word length you can
improve the distribution of filled buckets so there is a lot lower
chance of repeatedly testing filled buckets.

You need to a couple of basic things with the algo, it must wrap around
so you can go off the end and back ito the begining again. It is
important with this tecnique that you use a PRIME number for the bucket
count so the wrap around does not keep reading the same set of buckets
and you need to keep track of the number of filled buckets so you know
when the table is starting to fill up.

It is normal that a hash table gets slower as it approaches saturation
so the idea is to use a large enough one that does not get saturated
this way. You still see code in 32 bit that has DOS style assumptions
and tiny hash tables but there is no reason why you cannot make them
with a member count over 100000 or even larger if necessary.

What is important is to ensure that you have an effective collision
handling technique as testing over a long time shows that if the table
has a wide enough distribution without the known defects, the collision
handling code is very fast and the general performance is very good.

There are a few tricks with the method that you use to allocate the
memory for the table, you need two arrays, one is the pointers to the
other which is the main data storage array. If you allocate the memory
as zero filed, it makes testing buckets very fast as you only have to
test the 1st byte to tell if its used or not. Yopu would normally write
the word at the start of the bucket and know where the data strored
with it is located.

It can be in the format something like the following.

If the bucket is for example 32 bytes long and it is advantageous to
have the buckets as intervals of 4 bytes for memory alignment purposes,
you set your own layout for what part holds the word and what part
holds the data.

With the bucket at 32 bytes, if you reserve the 1st 16 bytes for the
word and bytes 16 to the end for the data, if the word matches, you
step another 16 bytes to get the info related to the word. The data can
be whatever you like and this includes being a pointer to another
memory location. Just for example, if you have the API functions as
prototypes, you locate the API while loading the hash, store the word
at the start of the bucket then store a pointer to the actual prototype
in the second half of the bucket.

Regards,

hutch at movsd dot com



Relevant Pages

  • Re: Perl hashing speedup ?
    ... I feel that disbalanced bucket usage / and less number of hash buckets ... Now I have some queries regarding way perl hashing mechanism maps keys ... about how perl hashes work internally; but I guess we will be needing ...
    (comp.lang.perl.moderated)
  • Re: Hash table in C++? STL?
    ... > a tree-like structure instead of lists for the buckets). ... > the standard will require equality or a relational comparator. ... the hashing function is where all the complexity lies and I'd be ... But we need hash only the first maxchars characters. ...
    (comp.lang.cpp)
  • Re: SGI hash_map
    ... >> studied hash functions you'll find on the web. ... > And before inserting into the map, each hash value is calculated with: ... a uniform distribution. ... a program to count the number of buckets that contain multiple items. ...
    (comp.lang.cpp)
  • Re: Hash table in C++? STL?
    ... I wouldn't dare claim to be either, without doing some research first. ... by the number of items that hash to that index. ... perhaps you meant "the size measured in number of buckets" and I was ... second hash function is used to determine the sequence of slots to try. ...
    (comp.lang.cpp)
  • Re: BSD license compatible hash algorithm?
    ... Hash: SHA1 ... a old wives tale that the number of buckets should be prime ... It very much depends on what is used for a rehash (collision step) ...
    (freebsd-hackers)