Re: hash table idea good or no good?

From: Michael B Allen (mba2000_at_ioplex.com)
Date: 11/20/03


Date: Thu, 20 Nov 2003 16:18:34 -0500

On Thu, 20 Nov 2003 07:58:11 -0500, Christian Gollwitzer wrote:

> 2. In the *best case*, your storage is O(n). But the time required for
> storing *one entry* is not O(1), but *O(log(n))*, where n is the number
> of entries.
>
> Proof: Keys are evenly distributed, your table sizes follow
> approximately the scheme c*2^k, k is the number of the table-level.
> (This is a reasonable choice)
> Now in the best case, the first c*2^0 items fill up table1. The next
> c*2^1 entries fill table2. The next c*2^2 fill table3. Now consider I
> want to store an item, and the tables 1..k are already full. Then the
> for-loop runs n times to find an empty cell. Since the items already
> contained in your map are
>
> n=sum(c*2^i, i=0..k-1)=c*(2^k-1),
>
> you need k~log(n) steps to find an empty cell, if the table contains n
> items. This means storing a new element is logarithmic in the number of
> items already contained, not O(1) as expected from the hash map. The
> reason is simply, that you do not exploit the fact that table1 is full
> and won't except further items.

I don't understand what you're saying. Clearly the time required to store
one entry is better than O(log(n)). A linked list would be O(log(n)). The
technique I described would be the number of levels (nl) plus the log
of the number of elements in the collision list (nc). The size of the
collision list is the number of collisions occuring as the load factor
grows from ~0.37 to 0.75 before the next level is allocated and the
load factor drops to 0.37 (1/2 of 0.75 because the total cumulative
size of the map (n) doubles). I don't know what that size is yet but
it's certainly not n.

> Another try:
>
>
> int
> hashmap_put(struct hashmap *h, void *key, void *data) {
> const double loadfactor=0.75;
> unsigned long hash;
> // level is an item of struct hashmap, initially = 0 // also
> items_in_this_level, initially = 0

I don't think this is a very good idea either. It will only fill 75%
of the table at each level which is a pretty bad waste of space in itself.

> Note: this is pseudocode, I've not tried to compile. It has another
> problem: It does not notice when one key is inserted twice. However, the

If the same key is inserted twice I will return an error.

> read performance is really bad. You'll get O(log n) again for searching
> one item.

Again, I don't see where you get linked list asymtotic growth from this.

> However, it does not suffer from the exponential storage
> problem: If you insert the same key many times, the overflow list simply
> grows, but it does not alloc exponentially increasing memory, which is
> your biggest problem. But searching an item isn't linear as in hashing
> with one table.

I was thinking more along the following:

int
hashmap_put(struct hashmap *h, void *key, void *data)
{
    unsigned long hash;
    int level;

    hash = h->hash ? h->hash(key) : hash_ptr(key);

    for (level = 0; level < 16; level++) {
        int table_size = table_sizes[level];
        struct entry *table = h->tables[level];
        struct entry *e;

        if (table == NULL) {
            iter_t iter;
            struct entry *e;
            struct linkedlist *col;

            table = allocator_alloc(h->al, table_size * sizeof *e, 1);
            if (table == NULL) {
                return -1;
            }
            h->tables[level] = table;

            col = (struct linkedlist *)&h->col[h->whichcol];
            h->whichcol ^= 1; /* alternate collision list */
                                        /* reinsert all elements */
            if (linkedlist_clear(col, (del_fn)entry_put, h) == -1) {
                return -1;
            }
        }

        e = table + hash % table_size;

        if (e->key == NULL) {
            e->hash = hash;
            e->key = key;
            e->data = data;
            h->size++;

            return 0;
        }

        if (hash == e->hash && h->cmp(key, e->key) == 0) {
            errno = EEXIST;
            return -1;
        }

        if (h->tables[level + 1] == NULL) {
            unsigned int lf = h->size * 100 / total_sizes[level];

            if (lf < h->load_factor) { /* add to collision list */
                struct linkedlist *col = (struct linkedlist *)&h->col[h->whichcol];
                struct entry *e = allocator_alloc(h->al, sizeof *e, 1);

                if (e == NULL || linkedlist_add(col, e)) {
                    return -1;
                }
                e->hash = hash;
                e->key = key;
                e->data = data;
                h->size++;

                return 0;
            }
        }
    }

    errno = ENOSPC;

    return -1;
}
void *
hashmap_get(const struct hashmap *h, const void *key)
{
    unsigned long hash;
    int level;

    if (h->tables[0] == NULL) {
        return NULL;
    }

    hash = h->hash ? h->hash(key) : hash_ptr(key);

    for (level = 0; level < 16; level++) {
        int table_size = table_sizes[level];
        struct entry *table = h->tables[level];
        struct entry *e;

        e = table + hash % table_size;

        if (e->key && hash == e->hash && h->cmp(key, e->key) == 0) {
            return e->data;
        }

        if (h->tables[level + 1] == NULL) {
            iter_t iter;
            struct entry *e;
            struct linkedlist *col = (struct linkedlist *)&h->col[h->whichcol];

                                    /* search the collision list */
            linkedlist_iterate(col, &iter);
            while ((e = linkedlist_next(col, &iter))) {
                if (hash == e->hash && h->cmp(key, e->key) == 0) {
                    return e->data;
                }
            }

            break;
        }
    }

    return NULL;
}



Relevant Pages