Re: Hash table in C++? STL?

From: Kevin Goodsell (usenet2.spamfree.fusion_at_neverbox.com)
Date: 04/15/04


Date: Thu, 15 Apr 2004 06:57:42 GMT

Siemel Naran wrote:

>
> I don't think I'm in that 1%.

I wouldn't dare claim to be either, without doing some research first.

> A naive guess is to add up all the letters
> and modulo by the bucket size,

I think you mean table size. Each bucket has a separate size determined
by the number of items that hash to that index.

Or, perhaps you meant "the size measured in number of buckets" and I was
just reading it wrong. I could see that.

> which they say should be prime, but why is
> this?

I'm not sure I ever completely understood that. The one obvious reason
doesn't apply to most implementations. In "probing" collision-resolution
schemes it makes sense, but most use "chaining" or buckets. What I'm
calling "probing" is the situation where a collision (i.e., a hash to a
location that's already taken) is handled by trying other locations. For
example, a naive approach is to try the next slot, until an empty one is
found. A better technique (I think) would be double hashing, where a
second hash function is used to determine the sequence of slots to try.
So if h1() is the primary hash function and h2() is the secondary,
suppose h1(x) = n, but n is already used. Then you would try n + h2(x),
n + 2*h2(x), n + 3*h2(x), etc. In all cases, modding by the table size.

In this case, having the table size prime ensures that you will
eventually try every table slot -- if there's an open one, you'll find it.

But nobody seems to use schemes like this very often, and it seems
obvious why. Why would you want to take up additional slots in the
table, increasing the chances of future collisions, when you can just
chain things outside the table (in buckets)? Lookup complexity for the
element is no worse (may even be better), and the table is less cluttered.

So why the prime table size? I can't see an obvious reason. If hash
values are approximately uniformly distributed between 0 and N, I'd
think that ideal table sizes would be numbers that evenly divide N,
because any remainder would split the table into two areas with
different probabilities of any particular value hashing into them.

>
> size_t hash(const char * s, int buckets) {
> size_t h = 0;
> for ( ; *s; ++s) h += *s;
> return h%bucket.
> }

Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.

>
> But we need hash only the first maxchars characters.

Actually, I think K&P also talked about this, and suggested it was a bad
idea. This is from memory again, so I may be wrong, but I believe they
mentioned Java using a scheme like this originally. It turned out to not
work well, and was changed to use the entire string.

-Kevin

-- 
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.


Relevant Pages

  • Re: Perl hashing speedup ?
    ... I feel that disbalanced bucket usage / and less number of hash buckets ... Now I have some queries regarding way perl hashing mechanism maps keys ... about how perl hashes work internally; but I guess we will be needing ...
    (comp.lang.perl.moderated)
  • Re: [PATCH] dentry and inode cache hash algorithm performance changes.
    ... Did u try the new hashing algorithm with dcache bench? ... number of clock cycles spent by the hash and lookup. ... that most buckets are ... With the new hash function, ...
    (Linux-Kernel)
  • Re: Hashing
    ... What you have to keep in mind is that a hash table for words in by ... linear testing of buckets. ... What is important is to ensure that you have an effective collision ... memory for the table, you need two arrays, one is the pointers to the ...
    (alt.lang.asm)
  • Re: maximum size of a hash table
    ... no matter what the hash function. ... that "the number of buckets must be larger than ... the number of keys" to make collisions rare. ...
    (comp.lang.perl.misc)
  • Re: Hash table in C++? STL?
    ... > a tree-like structure instead of lists for the buckets). ... > the standard will require equality or a relational comparator. ... the hashing function is where all the complexity lies and I'd be ... But we need hash only the first maxchars characters. ...
    (comp.lang.cpp)