Re: hash table idea good or no good?

From: Michael B Allen (mba2000_at_ioplex.com)
Date: 11/23/03


Date: Sun, 23 Nov 2003 05:23:33 -0500

On Fri, 21 Nov 2003 05:59:44 -0500, Christian Gollwitzer wrote:
>> I don't understand what you're saying. Clearly the time required to
>> store one entry is better than O(log(n)). A linked list would be
>> O(log(n)).
> Nope. A linked list has O(n) because you cannot perform a binary search.
> A binary tree has O(log(n)) in the average case and O(n) in the worst
> case.

You're right. I was thinking of average time.

> Assume I use the hashmap once for inserting a dictionary and then for
> looking up values many times. Assume I've inserted roughly two million
> items and there were no or only few collision. Then about half of the
> items, i.e. one million, is in table[15]. Because you look first in
> table[0], then table[1] etc. you need 16 accesses until the item is
> found for 50% of the items. 15 accesses for 25% and so on. So the
> average number of accesses for an item is:
>
> sum(k*2^(k-17), k=1..16) = 15.00001525....

Ok, I understand this.

>
> So the average number of levels consulted is approximately fifteen. If
> you'd used rehashing, than all elements not in the collision list would
> be found after only one access, which is clearly faster. So for this

This I do not understand. It may be necessary to search ever position
with rehashing whereas with the levels scheme you're guaranteed to
eliminate half of the table positions with each test of a level.

> kind of use your approach is not suitable. It maybe for others. Try
> running it on a large example (canonical one is counting the frequency
> of all words in a large textfile) and compare with other
> implementations, like the hashmap from STL or CBFalconers
> implementation. Hard to say from theory, which one is the best in these
> cases. Of course you need to use the same hash function for such a test.
> Measure time and storage used. And try with one particularly bad "hash
> function" that simply always returns one to see what happens in case of
> many collisions.

Well numbers are king of course. I tried to run some tests like
you described but unfortunately I could not get my levels scheme to
actually work. It was dropping elements and leaking like a sieve. I have
implemented rehashing with resizing and it does work. It is consideribly
simpler too. Even if the levels scheme turned out to be faster (and
I'm still not convinced that it's not) I greatly prefer simplicity so
I would probably stick with the rehashing implementation anyway.

You win!

Mike