Re: Efficient Text File Copy

From: Chris Torek (nospam_at_torek.net)
Date: 01/31/04


Date: 31 Jan 2004 21:12:07 GMT


>> Richard Heathfield wrote:
>>> For a hashing algorithm, either use K&R's (p144!) or Chris Torek's
>>> (with the 33 multiplier rather than 31).

Note, it is not really "my" hash (I got it it from James Gosling
many years ago -- this is the same James Gosling who is behind
Java, incidentally, but he was at CMU at the time). It uses the
remarkably simple recurrence:

    unsigned int h;
    ...
    h = 0;
    for (all bytes in the string)
        h = h * 33 + this_byte;

or more concretely, for a C-style string pointed to by "cp":

    for (h = 0; (c = *cp++) != 0;)
        h = h * 33 + c;

>CBFalconer wrote:
>> Why do you recommend this?

In article <news:bvg898$38j$1@hercules.btinternet.com>
Richard Heathfield <binary@eton.powernet.co.uk> writes:
>Why do I recommend Chris Torek's hash? Because Chris says it
>works well and I had no reason to disbelieve him; in my experience,
>he's right - it does work well.

I think he meant "why 33" (instead of a prime number, like the
"more obvious" 31 and 37). I wonder the same thing. Gosling used
33, and I tried 31 and 37, and when others were doing larger-scale
experiments (for what eventually became the "Berkeley DB" library)
I suggested trying even more variations. Different datasets had
different outcomes, but on average 33 worked better than 31.

Note that this simple hash is less effective than either a CRC or
a strong cryptographic hash (both of those distribute the input
bits much better), but this one is very fast to compute. As it
turns out, for in-memory hashing, the distribution of the hash is
often less critical than the time required to compute it.
Multiplication by 31 and 33 are both quite quick on most CPUs, and
the additional bucket-search effort that occurs after this "less
effective" hash tends to use up less than the "saved time" as
compared to a more effective hash function.

-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: forget about it   http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.