Re: Suggestions for double-hashing scheme



Clint Olsen <clint@xxxxxxxxx> writes:

> On 2005-06-21, Tim Rentsch <txr@xxxxxxxxxxxxxxxxxxx> wrote:
> > Sorry for not responding to this earlier.
>
> No problem!
>
> > The items that are being moved are the items in the hash table itself,
> > which are of fixed size (they are in an array, after all). If you want
> > to hash variable-sized data, they need to be stored separately in any
> > case. The hash table array would store, eg, a pointer to the memory
> > holding the element being hashed.
> >
> > Make sense?
>
> Well, just to make sure we're talking about the same thing, I am planning
> on allocating enough space to store data elements of any size. So, my
> struct is going to look something like:
>
> typedef struct {
> char *data; /* = malloc(length * width) */
> unsigned hi; /* watermark for upsize */
> unsigned lo; /* watermark for downsize */
> unsigned size; /* number of elements in the hash */
> unsigned width; /* size per object */
> unsigned length; /* the current hash capacity */
> struct {
> unsigned status; /* { <probes 31:1>, <valid-bit 1:1> } */
> unsigned hash_val0; /* calculated initial hash value */
> unsigned hash_val1; /* calculated increment hash value */
> } *entries; /* = malloc(length * sizeof(entry)) */
> } hash_t;
>
> [snip]

I see what you're trying to do, or more accurately why you're doing
what you're doing.

Having the space for the stored data elements be parallel to the hash
table is costly of space, especially if the size of a stored data
element is large. (You already mentioned the cost of moving them
around when rebuilding or growing the hash table.) The reason for
this is that the space is paid even for hash table slots that are
empty, and that's anywhere from 10% to 50% of the table.

Another idea, and what I might suggest, is having a tightly packed
array to hold the stored data elements, and have the hash table hold
an index into that array. Something like this:

typedef struct {
Count slots_used; /* # entries installed in hash table */
Count slots_low; /* threshold for downsizing */
Count slots_high; /* threshold for upsizing */
Index slots_limit; /* number of slots in the hash table */
Index *table; /* hash table; holds indices into elements */
uchar *probe_counts; /* || to hash table; holds probe counts */

size_t element_size; /* size of one stored element */
char *elements; /* malloc( elements_limit * elements_size ) */
Index elements_next; /* index of next free space */
Index elements_limit; /* number of items now allocated */
Hashes *hashes; /* malloc( elements_limit * sizeof *hashes ) */

Index free_list; /* to track deleted element holes */
Index *deleted; /* alternative method to above */
Index deleted_next; /* place to put next index for deleted */
Index deleted_limit; /* limit of deleted */
} hash_t;

Some further explanation - I've used type names 'Index' to mean an
something that holds an index, and 'Count' to mean something that
counts the number (of something). I find these names convenient for
documentation; they are both simply typedef'ed to unsigned or
unsigned long.

The 'probe_counts' array is parallel to the hash table array to save
on space. One "uchar" aka 'unsigned char' is plenty to hold a probe
count - they will never get as high even as 100 unless things are
terribly wrong or the hash table is severely in need of rebuilding.
(See also below for a way of piggybacking the probe counts into
the hash table directly.)

The 'Hashes' type holds the hash/rehash values. The observation is
that you only need to hold on to these for items that are actually in
the table; not for hash table slots that are empty or deleted. It
could be defined as

typedef struct { unsigned hash; unsigned rehash; } Hashes;

The 'hashes' space is exactly parallel to the 'elements' space:
the same index values are used for both, and when one grows or
shrinks the other must also (notices they are both tied to the
variable 'elements_limit').

Having the elements data storage space be separately maintained in
this way means the hash table array can grow or shrink without having
to affect the 'elements' array, since the hash table holds indices
into the element array, which don't change. When the 'elements' array
needs to grow or shrink, that can be done with 'realloc()', which
saves you the trouble of copying the data.

Now for the complications, which I think are minor:

First, the main external entry point installs a data item. However,
you will find it useful for internal purposes to have an internal
routine that takes an 'Index' into elements as the item to be hashed.
The external entry point will call the internal one when an item that
has already been installed is bumped, for example. The internal one
is also used when rebuilding or growing.

Second, when removing an element, something needs to be done so
that the space in the 'elements' array is reclaimed. There are
several ways this might be done, here are three:

1. If the element being removed is the last element in the
'elements' array (so its index == elements_next - 1),
then simply decrement elements_next; otherwise, put
the element that is at index elements_next-1 in place of
the element being deleted (remembering also to move the
hash/rehash values in the parallel array), and change
the entry in the hash table to the index of the element
being deleted; the changing can be done without reference
to the data of the actual element, since we know what
index to look for, and we know its hash/rehash values;
or,

2. If element_size % sizeof(Index) == 0, then we can make
a free list of element indices that have had their
data removed, storing the link in the storage of the
removed element; the head of the list is 'free_list'
in the struct above (with suitable casting to store
or access the next index in the list); or,

3. Simply have a separate array 'deleted' that holds
indices of removed elements (with the usual "next"
and "limit" pointers to add/remove and to know when
to grow, respectively.

These methods can be separately or in combination.

Incidentally, some terminology. I use '..._next' to mean the index of
the next free slot, and '..._limit' to mean the index limit of what
has been allocated. So the invariant

0 <= XXX_next && XXX_next <= XXX_limit

holds for "array" XXX, and also

0 <= i && i < XXX_next

holds for any index i of any item in the array. (The hash table isn't
allocated sequentially, so it doesn't have an '..._next' variable.)

To indicate the empty/deleted slots, either use an index of minus one,
or use an index of zero and don't use the zero'th slot of the elements
and hashes arrays. You don't need a separate bit to differentiate
empty and removed - if the probe count is zero it's empty, if the
probe count is greater than zero it was removed. As I said before
it's important to maintain the probe counts for removed slots so
things work correctly.

There's an encoding trick to put the probe counts into the hash table
along with the index. Basically, the idea is to store

(index << 4) + probe_count

I won't elaborate further on that, I expect you can work out the
details.

Ok, I'm sure that's plenty. Let me know if this helps your thinking.
.



Relevant Pages

  • Duck Typed Concepts for Ruby (was Re: A use case for an ordered hash)
    ... An Sequencable mixin can be defined that implements all sorts of operations such as append, concat, splice, sort, etc. ... extending an instance of Array with Sorted if the array is known to be sorted. ... Now returning to the thread at hand we can see that the difference between the associative array and hash hierarchies is that the hash hierarchy depends upon Hashable keys. ...
    (comp.lang.ruby)
  • [SUMMARY] Index and Query (#54)
    ... This was a fun quiz for me because I really don't know anything about indexing ... We see in initializethat the index is just a Hash. ... an Array of symbolic document names where the word can be found ). ...
    (comp.lang.ruby)
  • Re: Comment on how to uniquely track your objects in C# / hash table / get hash code
    ... Array, correct? ... This is largely irrelevant to the issue of performance, since hash ... where both insertions and lookups happen frequently, ... about fast lookups for balanced red/black binary trees. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: problem with hash & sort array
    ... Then I put the hash into an array with the sort ... The twist is I have to identify the duplicate ... a real data structure, and in the next place I make the exact opposite ...
    (comp.lang.perl.misc)
  • Re: How to implement a Hash Table in C
    ... with second hash value used for probe increment) ... works just fine if the probe increment is odd. ... choice is between pointer/hash and pointer. ...
    (comp.lang.c)