Re: Hashing
websnarf_at_gmail.com
Date: 03/15/05
- Previous message: Evenbit: "Re: So what does everyone look like?"
- Maybe in reply to: Chewy509: "Hashing"
- Next in thread: Betov: "Re: Hashing"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 14 Mar 2005 17:34:44 -0800
Octavio wrote:
> websnarf@gmail.com wrote:
> > Yes, but speed for a hash table is related to quality (because
> > that's what determines the number of collisions, and rehashes you
> > will need to do) as well as flat out performance. A good hash
> > function, I would suggest, needs to do well on both.
>
> I would add that "the hash table size must be a prime number and that
> greater is it better performance due to less collisions".
>
> And now i will explain why i totally disagree with those topics. Its
> true that quality hash codes reduces the collisions, thats good, but
> not so important, if one algorithm has 10% of collisions, and another
> has the double, then is only 10% better, but if the second algorithm
> is 4x faster then it is 4x better and betwen the various hash
> algorithms there is no big differencies in the number of collisions,
> but there are important differencies on speed.
Ok, but I have a different point of view on this. I am trying to
present a hash function that can be used for any purpose, now or later,
on any data, by me or anyone else who want to use the hash function.
Its basically a notion of "reusable code". I.e., its only useful to me
if I am not constantly maintaining it, or adjusting it for a similar,
yet not identical purpose as I am using it for today. So by ensuring
that my function is as collisions-free as possible under *any*
conditions it means that in 10 years from now when I grab it for use
for whatever unknown purpose, I won't be screwed because my "real world
data" has changed.
All this being said, I recognize the value of your function and your
approach even if others seem blind to it. Of course I am skeptical of
its resiliancy to more rigorous stress testing, but that should not
discourage you. More to the point, it is very possible, in fact most
likely that even if your function ends up showing flaws in its present
form today, that it can be fixed with a final fixed cost avalanche
mixer at the end of your code. Look back at the end of my function and
you will see what I am talking about. I have the code at the end which
continues to grind on the hash value even after it has stopped
consuming input data.
It is very likely that, when I get some time, I will set up the much
more powerful "bit avalanche" test into my hash test, and more than
likely this initial version of your function will fail. But if, as I
suspect, your function truly is a lot faster than my function, then I
will pursue the strategy of a "fix up" for your function, and in the
end it will more than likely be the fastest thing out there and at
least as robust as anything else.
So I hope you understand, my criticism of your function is just me
kicking the tires. I'm actually quite intrigued by the design -- using
a 64 bit accumulator seems like a very straightforward method for
capturing a lot of performance, and with the jarring simplicity of your
mixing function I was quite stunned to see it work very effectively on
typical sample text.
> Rehashes are used in closed hash algorithms, that only on a few cases
> are better than the open method. But they are a waste of time, a
> simple pointer increment has the same probabilities of collisions and
> is faster.
Yeah but it has a clustering problem. You actually don't want to do it
this way. If your closed hash table is < 70% full (which you want)
then the probability that you need a rehash on any scan is something
quite reasonable like 30% or so. So concentrating your efforts on
optimizing the rehashing for brute for performance isn't the right way
to go -- you instead want to minimize the probability that you have to
do it, or decrease the large costs coming from walking long rehash
chains.
> [... explanation of closed/open hash tables snipped ...] On a given
> language for example spanish we can have 1000000 of words but 95% of
> times we are using a small subset of about 1000 words ,if those words
> are placed at the begining of the lists (lru), with a hash table of
> 2048 we have a lot of probabilities of finding the word at the first
> try, even with 99.9 of collisions.
Yes, yes, yes. This is all fine, but by doing this you are leaning
very heavily on the characteristics and usage of the input data. In
general, many of the mechanisms beyond the hash function itself will
contribute to the effectiveness of a hash table, especially if you can
characterize the data you are hashing. But this brings me back to my
earlier point, that I am interesting in writing a function and
mechanism that I don't need to keep readjusting or tweaking just
because my input data changes in characteristics.
> [...] And since the hash table is small, the cache will make things
> to run faster. And finally the prime number size for hash table, was
> used in the era of thermo-ionic valves, but now, at least good
> programmers use powers of two.
:) Of course. With a closed hash table, in fact, you can make the
table size a power of two and the "rehash linear skip values" odd
numbers. This allows the rehash to still be able to walk the entire
table. You can in fact improve upon this by making the skip value a
prime (this reduces clustering).
But you should not rush to dismiss all theory. Your aging professor
might not know much about hand optimized assembly, but may still say
something useful in the lectures. The key to deriving value from
"theory", just like programming, is understanding the core material at
its most basic level. This is why I know that power of 2 sized hash
tables, with prime skip values are a good idea. Not because some
professor told me that, but rather because he explained why prime sized
hash tables were a good idea. In understanding the idea, I am able to
find a better way to achieve the same thing.
> [...] There are also some others algorithms that other people call
> 'hash method' simply because they use hash codes, like binary trees
> that use a hash code instead of the word itself to build the tree, it
> has the advantage that the tree usually is better balanced than
> another that uses the words, for this purpose yes, a good hash code
> is important.
Right, a very good point. My goal, however, is to have a function
which is both good *and* fast. Then I don't find myself making a
different choice for different purposes. You see?
There is a natural intuition people have that you sometimes sacrifice
speed for quality. My tendency is to want to have both at the same
time.
> > you are hashing Pascal-style strings. Is this correct?
>
> if Pascal-style is something like db 6,'string' then is correct, I
> think this is more eficient than asciiz strings that can be longer
> but waste time with calls to strlen,and are limited to text data.
I perfectly agree. See: http://bstring.sf.net/ :)
-- Paul Hsieh http://www.pobox.com/~qed/
- Previous message: Evenbit: "Re: So what does everyone look like?"
- Maybe in reply to: Chewy509: "Hashing"
- Next in thread: Betov: "Re: Hashing"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|