Re: hash table idea good or no good?

From: Christian Gollwitzer (auriocus_at_btcips73x1.cip.uni-bayreuth.de)
Date: 11/20/03


Date: Thu, 20 Nov 2003 13:58:11 +0100

Michael B Allen wrote:
> On Wed, 19 Nov 2003 12:06:47 -0500, Christian Gollwitzer wrote:
>
>>OK, I'm a bit confused what your actual key is. I'll try to describe
>
>
> The key is a void * that will be passed to a user defined hash function
> for generating the hash value.
That does it make clear to me, together with your code below, so Case a)
is what you do.

>>Case a)
>>You want to use strings as the key. Your hash=hash_fn(key) returns some
>>hash (say hash_fn(key)=MD5(key)) which than serves as the primary key.
>>To compute the index for table[i], you use
>>
>>table[i][hash % prime[i]]
>>
>>The flaw here is, that different keys may hash to the same hash_fn(key).
>>If you take 17 strings that have the same hash=hash_fn(key), then you
>>will end up with the table full with 17 entries. That's what I ment with
>> my first message
>
>
> Sure, but the likelyhood of the first 17 entries generating the same
> hash value is very very small. Still I acknoledge that case will need
> to be dealt with to prevent someone from deliberately trying to exploit
> the problem.
>
>
>>Case b)
>>You want to store strings, but you use the entire string as one big
>
> <snip>
>
>>Case c)
>>Your keys are really only 32 bit integers. In this case the series above
>
>
> I don't really understand what you're saying here. If the hash function
> is poor that's a different issue.
Forget abaout cases b) and c), it arose because I was not aware what
your key is. Now it is clear. Two comments to your code:

1. You claim it is unlikely that the first 17 keys hash to the same
value. As this is probably true, it is much more likely that (given the
user chooses a bad hash function) e.g. from 2000 entries inserted some
16 hash to the same value. In this case, two million entries are alloced
causing a storage ratio 1:1000. You cannot safely assume the user has a
very good hash function. The standard technique requires O(n) time in
this case, but also O(n) space to store all. Your technique requires
O(n) time but O(exp(n)) space in this worst case, which is clearly
inferior.

2. In the *best case*, your storage is O(n). But the time required for
storing *one entry* is not O(1), but *O(log(n))*, where n is the number
of entries.

Proof: Keys are evenly distributed, your table sizes follow
approximately the scheme c*2^k, k is the number of the table-level.
(This is a reasonable choice)
Now in the best case, the first c*2^0 items fill up table1.
The next c*2^1 entries fill table2. The next c*2^2 fill table3.
Now consider I want to store an item, and the tables 1..k are already
full. Then the for-loop runs n times to find an empty cell. Since the
items already contained in your map are

n=sum(c*2^i, i=0..k-1)=c*(2^k-1),

you need k~log(n) steps to find an empty cell, if the table contains n
items. This means storing a new element is logarithmic in the number of
items already contained, not O(1) as expected from the hash map. The
reason is simply, that you do not exploit the fact that table1 is full
and won't except further items.

> int
> hashmap_put(struct hashmap *h, void *key, void *data)
> {
> unsigned long hash;
> int level;
>
> hash = h->hash ? h->hash(key) : hash_ptr(key);
>
> for (level = 0; level < 16; level++) {
> int table_size = table_sizes[level];
> struct entry *table = h->tables[level];
> struct entry *e;
>
> if (table == NULL) {
> table = calloc(table_size, sizeof *e);
> if (table == NULL) {
> return -1;
> }
> h->tables[level] = table;
> }
>
> e = table + hash % table_size;
>
> if (e->key == NULL) {
> e->key = key;
> e->data = data;
> h->size++;
> return 0;
> }
>
> if (h->cmp(key, e->key) == 0) {
> errno = EEXIST;
> return -1;
> }
> }
>
> errno = ENOSPC;
>
> return -1;
> }
>

Another try:

int
hashmap_put(struct hashmap *h, void *key, void *data)
{
     const double loadfactor=0.75;
     unsigned long hash;
     // level is an item of struct hashmap, initially = 0
     // also items_in_this_level, initially = 0

     int table_size = table_sizes[h->level];
     struct entry *table;
     struct entry *e;

     hash = h->hash ? h->hash(key) : hash_ptr(key);

     // resize only if loadfactor reached
     if (table_size*loadfactor < h->items_in_this_level) {
       h->level++;
       if (h->level>16) return -1;

       h->items_in_this_level=0;
       table_size = table_sizes[level];
       table = calloc(table_size, sizeof *e);
       if (table == NULL) {
            return -1;
       }
       h->tables[h->level] = table;
       // now insert all items from the overflow list into the table,
       // perhaps getting new collisions
       for (int i=0; i<h->overflow_list_size; i++) {
         entry = h->overflow_list[i];
        hash = hash_fn (entry->key);
        e = table + hash % table_size;
        
        if (e->key == NULL) {
          *e = entry;
          h->overflow_delete(i);
        }
       }

     } else {
        table= h->tables[h->level];
     }

     e = table + hash % table_size;

     if (e->key != NULL) {
        //insert into overflow buffer
        h-> overflow_insert(e);
        h->items_in_this_level++;
        return 0;
     }

     e->key = key;
     e->data = data;
     h->size++;
     h->items_in_this_level++;
     return 0;
}

Note: this is pseudocode, I've not tried to compile.
It has another problem: It does not notice when one key is inserted
twice. However, the read performance is really bad. You'll get O(log n)
again for searching one item. However, it does not suffer from the
exponential storage problem: If you insert the same key many times, the
overflow list simply grows, but it does not alloc exponentially
increasing memory, which is your biggest problem. But searching an item
isn't linear as in hashing with one table.

Have fun,

-- 
Vale !
		Christianus Auriocus


Relevant Pages