Re: hash table idea good or no good?
From: Christian Gollwitzer (auriocus_at_btcips73x1.cip.uni-bayreuth.de)
Date: 11/20/03
- Next message: Darrell Grainger: "Re: Find the series"
- Previous message: Michael Littlejohn: "Re: hours programming, max"
- In reply to: Michael B Allen: "Re: hash table idea good or no good?"
- Next in thread: Michael B Allen: "Re: hash table idea good or no good?"
- Reply: Michael B Allen: "Re: hash table idea good or no good?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 20 Nov 2003 13:58:11 +0100
Michael B Allen wrote:
> On Wed, 19 Nov 2003 12:06:47 -0500, Christian Gollwitzer wrote:
>
>>OK, I'm a bit confused what your actual key is. I'll try to describe
>
>
> The key is a void * that will be passed to a user defined hash function
> for generating the hash value.
That does it make clear to me, together with your code below, so Case a)
is what you do.
>>Case a)
>>You want to use strings as the key. Your hash=hash_fn(key) returns some
>>hash (say hash_fn(key)=MD5(key)) which than serves as the primary key.
>>To compute the index for table[i], you use
>>
>>table[i][hash % prime[i]]
>>
>>The flaw here is, that different keys may hash to the same hash_fn(key).
>>If you take 17 strings that have the same hash=hash_fn(key), then you
>>will end up with the table full with 17 entries. That's what I ment with
>> my first message
>
>
> Sure, but the likelyhood of the first 17 entries generating the same
> hash value is very very small. Still I acknoledge that case will need
> to be dealt with to prevent someone from deliberately trying to exploit
> the problem.
>
>
>>Case b)
>>You want to store strings, but you use the entire string as one big
>
> <snip>
>
>>Case c)
>>Your keys are really only 32 bit integers. In this case the series above
>
>
> I don't really understand what you're saying here. If the hash function
> is poor that's a different issue.
Forget abaout cases b) and c), it arose because I was not aware what
your key is. Now it is clear. Two comments to your code:
1. You claim it is unlikely that the first 17 keys hash to the same
value. As this is probably true, it is much more likely that (given the
user chooses a bad hash function) e.g. from 2000 entries inserted some
16 hash to the same value. In this case, two million entries are alloced
causing a storage ratio 1:1000. You cannot safely assume the user has a
very good hash function. The standard technique requires O(n) time in
this case, but also O(n) space to store all. Your technique requires
O(n) time but O(exp(n)) space in this worst case, which is clearly
inferior.
2. In the *best case*, your storage is O(n). But the time required for
storing *one entry* is not O(1), but *O(log(n))*, where n is the number
of entries.
Proof: Keys are evenly distributed, your table sizes follow
approximately the scheme c*2^k, k is the number of the table-level.
(This is a reasonable choice)
Now in the best case, the first c*2^0 items fill up table1.
The next c*2^1 entries fill table2. The next c*2^2 fill table3.
Now consider I want to store an item, and the tables 1..k are already
full. Then the for-loop runs n times to find an empty cell. Since the
items already contained in your map are
n=sum(c*2^i, i=0..k-1)=c*(2^k-1),
you need k~log(n) steps to find an empty cell, if the table contains n
items. This means storing a new element is logarithmic in the number of
items already contained, not O(1) as expected from the hash map. The
reason is simply, that you do not exploit the fact that table1 is full
and won't except further items.
> int
> hashmap_put(struct hashmap *h, void *key, void *data)
> {
> unsigned long hash;
> int level;
>
> hash = h->hash ? h->hash(key) : hash_ptr(key);
>
> for (level = 0; level < 16; level++) {
> int table_size = table_sizes[level];
> struct entry *table = h->tables[level];
> struct entry *e;
>
> if (table == NULL) {
> table = calloc(table_size, sizeof *e);
> if (table == NULL) {
> return -1;
> }
> h->tables[level] = table;
> }
>
> e = table + hash % table_size;
>
> if (e->key == NULL) {
> e->key = key;
> e->data = data;
> h->size++;
> return 0;
> }
>
> if (h->cmp(key, e->key) == 0) {
> errno = EEXIST;
> return -1;
> }
> }
>
> errno = ENOSPC;
>
> return -1;
> }
>
Another try:
int
hashmap_put(struct hashmap *h, void *key, void *data)
{
const double loadfactor=0.75;
unsigned long hash;
// level is an item of struct hashmap, initially = 0
// also items_in_this_level, initially = 0
int table_size = table_sizes[h->level];
struct entry *table;
struct entry *e;
hash = h->hash ? h->hash(key) : hash_ptr(key);
// resize only if loadfactor reached
if (table_size*loadfactor < h->items_in_this_level) {
h->level++;
if (h->level>16) return -1;
h->items_in_this_level=0;
table_size = table_sizes[level];
table = calloc(table_size, sizeof *e);
if (table == NULL) {
return -1;
}
h->tables[h->level] = table;
// now insert all items from the overflow list into the table,
// perhaps getting new collisions
for (int i=0; i<h->overflow_list_size; i++) {
entry = h->overflow_list[i];
hash = hash_fn (entry->key);
e = table + hash % table_size;
if (e->key == NULL) {
*e = entry;
h->overflow_delete(i);
}
}
} else {
table= h->tables[h->level];
}
e = table + hash % table_size;
if (e->key != NULL) {
//insert into overflow buffer
h-> overflow_insert(e);
h->items_in_this_level++;
return 0;
}
e->key = key;
e->data = data;
h->size++;
h->items_in_this_level++;
return 0;
}
Note: this is pseudocode, I've not tried to compile.
It has another problem: It does not notice when one key is inserted
twice. However, the read performance is really bad. You'll get O(log n)
again for searching one item. However, it does not suffer from the
exponential storage problem: If you insert the same key many times, the
overflow list simply grows, but it does not alloc exponentially
increasing memory, which is your biggest problem. But searching an item
isn't linear as in hashing with one table.
Have fun,
-- Vale ! Christianus Auriocus
- Next message: Darrell Grainger: "Re: Find the series"
- Previous message: Michael Littlejohn: "Re: hours programming, max"
- In reply to: Michael B Allen: "Re: hash table idea good or no good?"
- Next in thread: Michael B Allen: "Re: hash table idea good or no good?"
- Reply: Michael B Allen: "Re: hash table idea good or no good?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
- Re: Suggestions for double-hashing scheme
... >> Once you sufficiently optimize a hash table, the big cost in hashing ...
>> by computing of the hash function). ... any good hash function basically
always has a minimal ... >> What you want is for each of these entries to have a
different reprobe ... (comp.programming) - Re: If not readdir() then what?
... All we have to do is cache the open files. ... We keep these in an LRU list
and a hash table. ... hash LRU chain, but rather a number of LRU chains. ...
Have a rbtree for storing directory entries. ... (Linux-Kernel) - Re: Algorimic Complexity Attacks
... > keyed hash. ... If the secret itself is not leaked in the attack
(and it ... this does have its difficulty: maintaining existing entries. ... This
means the attack will be thwarted if the secret hash function (e.g. ... (Bugtraq) - Re: Hash Function problem
... I need to generate a hash key for 50 millions different ... entries.
... key will be used as an unique id, I want as few collisions as ... Do you think
a hash function with a 64 bit output will be ... (comp.programming) - Re: Suggestions for double-hashing scheme
... chain style and reprobe style are basically a wash. ... will be a smaller chance
of encountering deleted entries before it. ... Once you sufficiently optimize a hash
table, ... by computing of the hash function). ... (comp.programming)