Re: Hashing

websnarf_at_gmail.com
Date: 03/15/05

  • Next message: \\\\o//annabee: "Re: So what does everyone look like?"
    Date: 14 Mar 2005 17:34:44 -0800
    
    

    Octavio wrote:
    > websnarf@gmail.com wrote:
    > > Yes, but speed for a hash table is related to quality (because
    > > that's what determines the number of collisions, and rehashes you
    > > will need to do) as well as flat out performance. A good hash
    > > function, I would suggest, needs to do well on both.
    >
    > I would add that "the hash table size must be a prime number and that
    > greater is it better performance due to less collisions".
    >
    > And now i will explain why i totally disagree with those topics. Its
    > true that quality hash codes reduces the collisions, thats good, but
    > not so important, if one algorithm has 10% of collisions, and another

    > has the double, then is only 10% better, but if the second algorithm
    > is 4x faster then it is 4x better and betwen the various hash
    > algorithms there is no big differencies in the number of collisions,
    > but there are important differencies on speed.

    Ok, but I have a different point of view on this. I am trying to
    present a hash function that can be used for any purpose, now or later,
    on any data, by me or anyone else who want to use the hash function.

    Its basically a notion of "reusable code". I.e., its only useful to me
    if I am not constantly maintaining it, or adjusting it for a similar,
    yet not identical purpose as I am using it for today. So by ensuring
    that my function is as collisions-free as possible under *any*
    conditions it means that in 10 years from now when I grab it for use
    for whatever unknown purpose, I won't be screwed because my "real world
    data" has changed.

    All this being said, I recognize the value of your function and your
    approach even if others seem blind to it. Of course I am skeptical of
    its resiliancy to more rigorous stress testing, but that should not
    discourage you. More to the point, it is very possible, in fact most
    likely that even if your function ends up showing flaws in its present
    form today, that it can be fixed with a final fixed cost avalanche
    mixer at the end of your code. Look back at the end of my function and
    you will see what I am talking about. I have the code at the end which
    continues to grind on the hash value even after it has stopped
    consuming input data.

    It is very likely that, when I get some time, I will set up the much
    more powerful "bit avalanche" test into my hash test, and more than
    likely this initial version of your function will fail. But if, as I
    suspect, your function truly is a lot faster than my function, then I
    will pursue the strategy of a "fix up" for your function, and in the
    end it will more than likely be the fastest thing out there and at
    least as robust as anything else.

    So I hope you understand, my criticism of your function is just me
    kicking the tires. I'm actually quite intrigued by the design -- using
    a 64 bit accumulator seems like a very straightforward method for
    capturing a lot of performance, and with the jarring simplicity of your
    mixing function I was quite stunned to see it work very effectively on
    typical sample text.

    > Rehashes are used in closed hash algorithms, that only on a few cases
    > are better than the open method. But they are a waste of time, a
    > simple pointer increment has the same probabilities of collisions and
    > is faster.

    Yeah but it has a clustering problem. You actually don't want to do it
    this way. If your closed hash table is < 70% full (which you want)
    then the probability that you need a rehash on any scan is something
    quite reasonable like 30% or so. So concentrating your efforts on
    optimizing the rehashing for brute for performance isn't the right way
    to go -- you instead want to minimize the probability that you have to
    do it, or decrease the large costs coming from walking long rehash
    chains.

    > [... explanation of closed/open hash tables snipped ...] On a given
    > language for example spanish we can have 1000000 of words but 95% of
    > times we are using a small subset of about 1000 words ,if those words

    > are placed at the begining of the lists (lru), with a hash table of
    > 2048 we have a lot of probabilities of finding the word at the first
    > try, even with 99.9 of collisions.

    Yes, yes, yes. This is all fine, but by doing this you are leaning
    very heavily on the characteristics and usage of the input data. In
    general, many of the mechanisms beyond the hash function itself will
    contribute to the effectiveness of a hash table, especially if you can
    characterize the data you are hashing. But this brings me back to my
    earlier point, that I am interesting in writing a function and
    mechanism that I don't need to keep readjusting or tweaking just
    because my input data changes in characteristics.

    > [...] And since the hash table is small, the cache will make things
    > to run faster. And finally the prime number size for hash table, was
    > used in the era of thermo-ionic valves, but now, at least good
    > programmers use powers of two.

    :) Of course. With a closed hash table, in fact, you can make the
    table size a power of two and the "rehash linear skip values" odd
    numbers. This allows the rehash to still be able to walk the entire
    table. You can in fact improve upon this by making the skip value a
    prime (this reduces clustering).

    But you should not rush to dismiss all theory. Your aging professor
    might not know much about hand optimized assembly, but may still say
    something useful in the lectures. The key to deriving value from
    "theory", just like programming, is understanding the core material at
    its most basic level. This is why I know that power of 2 sized hash
    tables, with prime skip values are a good idea. Not because some
    professor told me that, but rather because he explained why prime sized
    hash tables were a good idea. In understanding the idea, I am able to
    find a better way to achieve the same thing.

    > [...] There are also some others algorithms that other people call
    > 'hash method' simply because they use hash codes, like binary trees
    > that use a hash code instead of the word itself to build the tree, it

    > has the advantage that the tree usually is better balanced than
    > another that uses the words, for this purpose yes, a good hash code
    > is important.

    Right, a very good point. My goal, however, is to have a function
    which is both good *and* fast. Then I don't find myself making a
    different choice for different purposes. You see?

    There is a natural intuition people have that you sometimes sacrifice
    speed for quality. My tendency is to want to have both at the same
    time.

    > > you are hashing Pascal-style strings. Is this correct?
    >
    > if Pascal-style is something like db 6,'string' then is correct, I
    > think this is more eficient than asciiz strings that can be longer
    > but waste time with calls to strlen,and are limited to text data.

    I perfectly agree. See: http://bstring.sf.net/ :)

    -- 
    Paul Hsieh
    http://www.pobox.com/~qed/
    

  • Next message: \\\\o//annabee: "Re: So what does everyone look like?"

    Relevant Pages

    • Re: When will md5crk complete?
      ... and in that case birthday attack ... > His core message is correct however: you shouldn't be using MD5. ... Collisions DO exist for every hash algorithm... ...
      (sci.crypt)
    • Re: Hashing
      ... Computing the hash function, which is handled by the instructions: ... reserved word/identifier when searching through the reserved words ... collisions and four slots that have four collisions. ...
      (alt.lang.asm)
    • Re: String hashing (was: Thou shalt have no other gods before the ANSI C standard)
      ... >> a rehash in resolving hash table collisions. ... the broken "Reply" link at the bottom of the article. ...
      (sci.crypt)
    • Re: CHECKSUM() question
      ... Let's assume for a moment that you use CHECKSUM and that there are hash collisions. ... If you stage the answer in a temporary table, you can use that to join to your source table, then filter out the few collisions. ... Remember absolutely no hash that reduces the size of data for searching can be guaranteed to avoid a collision. ... However, from a private discussion, Steve Kass pointed out that HASHBYTES with MD5 for 300 million rows probably has a lower chance of collision than the the possibility that some bit will get randomly changed by some other influence. ...
      (microsoft.public.sqlserver.programming)
    • Re: Collision in SHA-0
      ... The entity requesting the certificate can often ... >able to find collisions in the underlying hash function, ... There are collisions and then there are collisions. ... same length as a cert and the same hash. ...
      (sci.crypt)

    Loading