Re: hash table idea good or no good?
From: Christian Gollwitzer (auriocus_at_btcips73x1.cip.uni-bayreuth.de)
Date: 11/19/03
- Next message: Programmer Dude: "Re: Selected text coordinates"
- Previous message: Alan Balmer: "Re: Programming Language Productivity: The Stupidity of Programmers"
- In reply to: Michael B Allen: "Re: hash table idea good or no good?"
- Next in thread: Michael B Allen: "Re: hash table idea good or no good?"
- Reply: Michael B Allen: "Re: hash table idea good or no good?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 19 Nov 2003 18:06:47 +0100
>>If your key-space is larger than the available cells
>
>
> If? It unsigned long is 32 bits and the primes I listed are much smaller
> than that then indeed the key-space is clearly much larger than the
> number of cells in each level.
OK, I'm a bit confused what your actual key is. I'll try to describe
three different uses of your function, for which I'll construct the
worst case storage.
Case a)
You want to use strings as the key. Your hash=hash_fn(key) returns some
hash (say hash_fn(key)=MD5(key)) which than serves as the primary key.
To compute the index for table[i], you use
table[i][hash % prime[i]]
The flaw here is, that different keys may hash to the same hash_fn(key).
If you take 17 strings that have the same hash=hash_fn(key), then you
will end up with the table full with 17 entries. That's what I ment with
my first message
Case b)
You want to store strings, but you use the entire string as one big
number = hash. Then your function accesses the values as
table[i][key % prime[i]]
Because in this case the key is unlimited in size, it is easy to find a
series with the same property as above:
key1=prime1 --> hashes to 0 in table 1
key2=prime1*prime2 --> hashes to 0 in table1, than to 0 in table2
.
.
.
key[n]=product(prime[k], k=1..n)
If the keys are inserted in this order (Note: The performance of your
scheme depends on the insertion order - this asymmetry is one of its
flaws), then again you will have obsessed each cell 0 in each table
causing O(exp(n)) storage
Case c)
Your keys are really only 32 bit integers. In this case the series above
won't work, because if the products are truncated, this will corrupt the
least common multiple scheme used above. An alternate scheme I tried was
much better in storage (O(2^n)), but may be it can be improved to
something polynomial:
Store 0.
Store p1. This will alloc table2, since it is hashed to zero in table 1
store p2%p1 //goes to table1
store p2 // goes to table3, because p2%p1 in t1 and 0 in t2 are used
store p3%p2%p1 //goes to table 1
store p3%p2 // goes to table 2
store p3%p1 // goes to table 1
store p3 // to table 4 , because hashes to zero in t3
As can be derived easily, to alloc level n, I need 2^(n-1) values at
maximum. There is a quite good chance that I don't need them all. This
scheme can be improved with the idea from b) given that the product of
two primes always hashes to zero for these two primes. May be with
number theory magic one can find even worse cases.
So in all cases you gave bad storage performance guaranteed for one
particular insertion order of keys. (OK, to be honest, I could not proof
it in the last case, but I suspect one can get it with O(n^2)).
>>(that's the basics
>>of hashing), you will always have collisions (simple set theory). Maybe
>>you can avoid this one extreme case, but I think it is basically no good
>>idea.
>
>
> Well your right; by itself this is not a good idea. I think it the worst
> case can be avoided however if it is combined with another technique. For
> example you could just maintain a list to catch all collisions until
> the list reaches some fraction of the size of the current level at
> which point you allocate the next level and re-insert all items in the
> collision list. That way you remove the chance for pathologically poor
> memory utilization but when you resize you only reinsert a fraction
> (e.g. 1/8th) of the number of items in the current level.
If you want to write optimize the hash table, *and* combine with another
technique, it may be successful. However, for read optimization it is no
good idea because you have approx. O(log n) steps for reading if the
table is full. (Considering your primes are approx. c^n, which is sensible)
-- Vale ! Christianus Auriocus
- Next message: Programmer Dude: "Re: Selected text coordinates"
- Previous message: Alan Balmer: "Re: Programming Language Productivity: The Stupidity of Programmers"
- In reply to: Michael B Allen: "Re: hash table idea good or no good?"
- Next in thread: Michael B Allen: "Re: hash table idea good or no good?"
- Reply: Michael B Allen: "Re: hash table idea good or no good?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|