Re: Efficient Text File Copy
From: CBFalconer (cbfalconer_at_yahoo.com)
Date: 02/01/04
- Previous message: Gianni Mariani: "Re: How to get better with C++?"
- In reply to: Chris Torek: "Re: Efficient Text File Copy"
- Next in thread: Keith Bostic: "Re: Efficient Text File Copy"
- Reply: Keith Bostic: "Re: Efficient Text File Copy"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 01 Feb 2004 01:29:38 GMT
Chris Torek wrote:
> Richard Heathfield <binary@eton.powernet.co.uk> writes:
> >CBFalconer wrote:
> >> Richard Heathfield wrote:
>
> >>> For a hashing algorithm, either use K&R's (p144!) or Chris
> >>> Torek's (with the 33 multiplier rather than 31).
>
> Note, it is not really "my" hash (I got it it from James Gosling
> many years ago -- this is the same James Gosling who is behind
> Java, incidentally, but he was at CMU at the time). It uses the
> remarkably simple recurrence:
>
> unsigned int h;
> ...
> h = 0;
> for (all bytes in the string)
> h = h * 33 + this_byte;
>
> or more concretely, for a C-style string pointed to by "cp":
>
> for (h = 0; (c = *cp++) != 0;)
> h = h * 33 + c;
>
> >> Why do you recommend this?
>
> > Why do I recommend Chris Torek's hash? Because Chris says it
> > works well and I had no reason to disbelieve him; in my
> > experience, he's right - it does work well.
>
> I think he meant "why 33" (instead of a prime number, like the
> "more obvious" 31 and 37). I wonder the same thing. Gosling used
> 33, and I tried 31 and 37, and when others were doing larger-scale
> experiments (for what eventually became the "Berkeley DB" library)
> I suggested trying even more variations. Different datasets had
> different outcomes, but on average 33 worked better than 31.
>
> Note that this simple hash is less effective than either a CRC or
> a strong cryptographic hash (both of those distribute the input
> bits much better), but this one is very fast to compute. As it
> turns out, for in-memory hashing, the distribution of the hash is
> often less critical than the time required to compute it.
> Multiplication by 31 and 33 are both quite quick on most CPUs, and
> the additional bucket-search effort that occurs after this "less
> effective" hash tends to use up less than the "saved time" as
> compared to a more effective hash function.
All of 31, 33, and 37 make intuitive sense to me. 31 and 33 will
obviously be faster on machines without multiplication.
I have the obvious testbed in hashlib, which was originally built
to test some other hash functions, and then expanded. I shall
have to build a driver to run through those and list efficiencies
Real Soon Now. I could also implement the multiplications with
shift and add, to make a total of 5 useful hashes. I don't expect
the differences to be earth shaking. The present hashlib
verification suite include the awful hash function 1. That's one.
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. <http://cbfalconer.home.att.net> USE worldnet address!
- Previous message: Gianni Mariani: "Re: How to get better with C++?"
- In reply to: Chris Torek: "Re: Efficient Text File Copy"
- Next in thread: Keith Bostic: "Re: Efficient Text File Copy"
- Reply: Keith Bostic: "Re: Efficient Text File Copy"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|
|