Re: hash table idea good or no good?

From: Christian Gollwitzer (auriocus_at_btcips73x1.cip.uni-bayreuth.de)
Date: 11/21/03

  • Next message: Ravi Kumar Munnangi: "understanding C code"
    Date: Fri, 21 Nov 2003 11:59:44 +0100
    
    

    Michael B Allen wrote:
    > On Thu, 20 Nov 2003 07:58:11 -0500, Christian Gollwitzer wrote:
    >
    >
    >>2. In the *best case*, your storage is O(n). But the time required for
    >>storing *one entry* is not O(1), but *O(log(n))*, where n is the number
    >>of entries.
    >
    > I don't understand what you're saying. Clearly the time required to store
    > one entry is better than O(log(n)). A linked list would be O(log(n)).
    Nope. A linked list has O(n) because you cannot perform a binary search.
    A binary tree has O(log(n)) in the average case and O(n) in the worst case.
    > The
    > technique I described would be the number of levels (nl) plus the log
    > of the number of elements in the collision list (nc).

    And the number of levels is approximately c*log(n), as I worked out
    above. The constant c may be very small, in this case you can forget
    about it for only few levels.

    > The size of the
    > collision list is the number of collisions occuring as the load factor
    > grows from ~0.37 to 0.75 before the next level is allocated and the
    > load factor drops to 0.37 (1/2 of 0.75 because the total cumulative
    > size of the map (n) doubles). I don't know what that size is yet but
    > it's certainly not n.
    >
    >
    >>Another try:
    >>
    >>
    >>int
    >>hashmap_put(struct hashmap *h, void *key, void *data) {
    >> const double loadfactor=0.75;
    >> unsigned long hash;
    >> // level is an item of struct hashmap, initially = 0 // also
    >> items_in_this_level, initially = 0
    >
    >
    > I don't think this is a very good idea either. It will only fill 75%
    > of the table at each level which is a pretty bad waste of space in itself.
    Yes, you are right with it. It is no good idea.

    >>read performance is really bad. You'll get O(log n) again for searching
    >>one item.
    >
    >
    > Again, I don't see where you get linked list asymtotic growth from this.

    > void *
    > hashmap_get(const struct hashmap *h, const void *key)
    > {
    > unsigned long hash;
    > int level;
    >
    > if (h->tables[0] == NULL) {
    > return NULL;
    > }
    >
    > hash = h->hash ? h->hash(key) : hash_ptr(key);
    >
    > for (level = 0; level < 16; level++) {
    > int table_size = table_sizes[level];
    > struct entry *table = h->tables[level];
    > struct entry *e;
    >
    > e = table + hash % table_size;
    >
    > if (e->key && hash == e->hash && h->cmp(key, e->key) == 0) {
    > return e->data;
    > }
    >
    > if (h->tables[level + 1] == NULL) {
    > iter_t iter;
    > struct entry *e;
    > struct linkedlist *col = (struct linkedlist *)&h->col[h->whichcol];
    >
    > /* search the collision list */
    > linkedlist_iterate(col, &iter);
    > while ((e = linkedlist_next(col, &iter))) {
    > if (hash == e->hash && h->cmp(key, e->key) == 0) {
    > return e->data;
    > }
    > }
    >
    > break;
    > }
    > }
    >
    > return NULL;
    > }

    Assume I use the hashmap once for inserting a dictionary and then for
    looking up values many times. Assume I've inserted roughly two million
    items and there were no or only few collision. Then about half of the
    items, i.e. one million, is in table[15]. Because you look first in
    table[0], then table[1] etc. you need 16 accesses until the item is
    found for 50% of the items. 15 accesses for 25% and so on.
    So the average number of accesses for an item is:

    sum(k*2^(k-17), k=1..16) = 15.00001525....

    So the average number of levels consulted is approximately fifteen. If
    you'd used rehashing, than all elements not in the collision list would
    be found after only one access, which is clearly faster. So for this
    kind of use your approach is not suitable. It maybe for others. Try
    running it on a large example (canonical one is counting the frequency
    of all words in a large textfile) and compare with other
    implementations, like the hashmap from STL or CBFalconers
    implementation. Hard to say from theory, which one is the best in these
    cases. Of course you need to use the same hash function for such a test.
      Measure time and storage used. And try with one particularly bad "hash
    function" that simply always returns one to see what happens in case of
    many collisions.

    -- 
    Vale !
    		Christianus Auriocus
    

  • Next message: Ravi Kumar Munnangi: "understanding C code"

    Relevant Pages

    • Re: Hashed stringlist or binary searchtree or ????
      ... > If you have got a good hash function (that is, ... But the hash ... but using trees is fast enough for most problems. ... implementations only need a few bytes for empty items). ...
      (borland.public.delphi.language.objectpascal)
    • Re: Some comments on "super fast hash"
      ... SFH seems reasonably good and certainly is fast. ... > a hash, and SFH does not. ... The latest versions of each hash function which leverages this ... it must behave worse on other key sets. ...
      (comp.programming)
    • Some comments on "super fast hash"
      ... I've implemented a hash function here: ... SFH seems reasonably good and certainly is fast. ... quality of the hash function is not affected by the difference as far ... it must behave worse on other key sets. ...
      (comp.programming)
    • Re: Maximum String size in Java?
      ... >> compilation on any new target platform that does not already have ... Do you have a version of SFH posted with changes to use this file ... If they intend to use a hash ... benefit of 31/33 will sway me into using more than one hash function. ...
      (comp.programming)
    • Re: Suggestions for double-hashing scheme
      ... chain style and reprobe style are basically a wash. ... will be a smaller chance of encountering deleted entries before it. ... Once you sufficiently optimize a hash table, ... by computing of the hash function). ...
      (comp.programming)