Re: Efficient Text File Copy

From: CBFalconer (cbfalconer_at_yahoo.com)
Date: 02/01/04

  • Next message: Jerry Coffin: "Re: stl vector find."
    Date: Sun, 01 Feb 2004 01:29:38 GMT
    
    

    Chris Torek wrote:
    > Richard Heathfield <binary@eton.powernet.co.uk> writes:
    > >CBFalconer wrote:
    > >> Richard Heathfield wrote:
    >
    > >>> For a hashing algorithm, either use K&R's (p144!) or Chris
    > >>> Torek's (with the 33 multiplier rather than 31).
    >
    > Note, it is not really "my" hash (I got it it from James Gosling
    > many years ago -- this is the same James Gosling who is behind
    > Java, incidentally, but he was at CMU at the time). It uses the
    > remarkably simple recurrence:
    >
    > unsigned int h;
    > ...
    > h = 0;
    > for (all bytes in the string)
    > h = h * 33 + this_byte;
    >
    > or more concretely, for a C-style string pointed to by "cp":
    >
    > for (h = 0; (c = *cp++) != 0;)
    > h = h * 33 + c;
    >
    > >> Why do you recommend this?
    >
    > > Why do I recommend Chris Torek's hash? Because Chris says it
    > > works well and I had no reason to disbelieve him; in my
    > > experience, he's right - it does work well.
    >
    > I think he meant "why 33" (instead of a prime number, like the
    > "more obvious" 31 and 37). I wonder the same thing. Gosling used
    > 33, and I tried 31 and 37, and when others were doing larger-scale
    > experiments (for what eventually became the "Berkeley DB" library)
    > I suggested trying even more variations. Different datasets had
    > different outcomes, but on average 33 worked better than 31.
    >
    > Note that this simple hash is less effective than either a CRC or
    > a strong cryptographic hash (both of those distribute the input
    > bits much better), but this one is very fast to compute. As it
    > turns out, for in-memory hashing, the distribution of the hash is
    > often less critical than the time required to compute it.
    > Multiplication by 31 and 33 are both quite quick on most CPUs, and
    > the additional bucket-search effort that occurs after this "less
    > effective" hash tends to use up less than the "saved time" as
    > compared to a more effective hash function.

    All of 31, 33, and 37 make intuitive sense to me. 31 and 33 will
    obviously be faster on machines without multiplication.

    I have the obvious testbed in hashlib, which was originally built
    to test some other hash functions, and then expanded. I shall
    have to build a driver to run through those and list efficiencies
    Real Soon Now. I could also implement the multiplications with
    shift and add, to make a total of 5 useful hashes. I don't expect
    the differences to be earth shaking. The present hashlib
    verification suite include the awful hash function 1. That's one.

    -- 
    Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
       Available for consulting/temporary embedded and systems.
       <http://cbfalconer.home.att.net>  USE worldnet address!
    

  • Next message: Jerry Coffin: "Re: stl vector find."

    Relevant Pages

    • Re: Encryption -- Blowfish limited to 8 byte passowrds?
      ... Chris wrote: ... Why won't Java allow other length ... > some kind of hash function. ... > algorithm itself is only acting on an 8-byte key. ...
      (comp.lang.java)
    • Re: Efficient Text File Copy
      ... >> Why do I recommend Chris Torek's hash? ... I have the obvious testbed in hashlib, ... verification suite include the awful hash function 1. ...
      (comp.lang.c)
    • Re: ISO vanilla class
      ... > Personally I'd use a hash not a hash reference: ... > both required and supposed to be a scalar. ... I wrote my own base class module that does most of what ... Chris Olive ...
      (comp.lang.perl.misc)
    • Re: Sending HASH over TCP
      ... > I have been working on this for a few days now, and still feel stuck. ... > I have a large HASH which I need to send via TCP to a server. ... XML-RPC. ... Chris Olive ...
      (comp.lang.perl.misc)
    • Re: Hash
      ... Oinsertion compared to Ofor hash tables. ... hash function is really to make the odds of a worst case ... hashlib includes ...
      (comp.programming)