Re: Hash Function problem
- From: websnarf@xxxxxxxxx
- Date: 8 Mar 2006 12:49:08 -0800
Arthur J. O'Dwyer wrote:
On Fri, 2 Mar 2006 websnarf@xxxxxxxxx wrote:
Arthur J. O'Dwyer wrote:[...]
On Fri, 2 Mar 2006 websnarf@xxxxxxxxx wrote:
bob_jenkins@xxxxxxxxxxxxxxxx wrote:
I've seen quite a few hashing applications that don't know the length
of the string. One is when the hash is computed at the same time the
original string is read.
Well hang on. Think about this for a minute. How are you reading the
string? Think about the inner loop of the string reading function.
How far away is it from know the length of the string at any time? The
function reading the string itself cannot be well defined unless it
implicitely knows the length of the string it has read -- at least I
cannot imagine such a function.
/* Read a text file and compute its MD5 hash. */
[...]while (fgets(buf, sizeof buf, stdin) != NULL)
md5_update_nullterminated(&h, buf);
The hash function is really done as an initializing step, zero or more
update steps, and a finalizing step. At no time does the hash function
need to know the length of the /whole/ bitstring it's hashing --- it
just needs to know a few bytes at a time.
Well my thinking about this is actually that fgets() has screwed you in
terms of performance because it doesn't deliver the length of what it
read to you. See my alternative here:
http://www.pobox.com/~qed/input.html
You see, if this is a performance bottleneck, then your real problem is
that the length that is implicitely calculated in the inner loop of
fgets is being hidden from you.
At the assembly level, "process until i==n" and "process until a[i]==0"
can be made to look remarkably similar.
The problem is that a semantic byte per byte dependency is created.
The "until i == n" can be massively unrolled and have a "i += 4" in it,
whereas the second cannot.
[...] The only time you should lose
performance /no matter what/ is if your loop doesn't otherwise have to
load a[i] into a register, and that's obviously not the case here.
Well, with my solution a[i], a[i+1], a[i+2] and a[i+3] can be
simultaneously loaded in a single register all at once (I actually
split it into two pairs in my hash function, since its better suited
for my hash function.) With the "until a[i] == 0" solution, you cannot
get rid of the 4 comparisons unless you pull non-portable tricks like
the block-based strlen() implementation that I referred to.
(In the real world there is only one compiler in the worlds that I am
aware of that uses my strlen recommendation, and that's because I
*personally* pointed it out to someone working at a compiler vendor
company at the time -- this is why Bob Jenkins found that my code beat
*ALL* the compilers he tested.)
I'm not saying all compilers will bother to optimize until the two cases
look remarkably similar, but I'd expect a compiler that passed your test
for "eliminating all other sources of bottlenecking" to optimize the
"a[i]==0" case too.
There's nothing to optimize -- the compiler is unable to eliminate the
fundamental character per character comparison operation required.
This isn't a matter of instruction scheduling, or picking clever
instructions -- its a semantics problem.
[...] I don't think fgets() needs to return us the length
of the string; we're going to traverse the whole thing in the hash
function anyway.
And we are going to do so necessarily character by character.
I am (naively?) assuming that whatever your hash function is, it's going
to access every byte of the input. If your hash function is something like
"take every Nth byte and XOR them together," where N>1, then I agree that
knowing the length of the string will speed things up.
Its called "loop unrolling". Its an extremely standard and very potent
performance technique. Look it up.
You can still be clever, however. Lets say you break the fgets() down
into smaller fgets() blocks of size k. You can throw a canary into
buff[k-1] and if after return buff[k-1] is filled with something other
than '\0' then you know that you didn't encounter an interceding '\n'
and therefore can assume the string is larger than n without having to
scan over those bytes. So the final strlen call in the case where you
don't know how much you read, is going to be O(k), not O(n).
Huh? In your example, k == n, if a "normal" strlen(buff) is meant to
take O(n). Or else "normal" strlens don't take O(n), and then it's
irrelevant to mention that O(k) < O(n). (In other words, what's your
'n' here?)
n is the length of the total input. k is assumed to be much much
smaller than n asymptotically. The O() that I am measuring is just the
cost of the strlen() that you actually call, not the whole read (its
still obviously O(n).) Its just to make the point that the cost of
final strlen call itself is basically nothing.
So the
length-prefix based, incremental hash functions start looking more
attractive. See?
No. To me, "incremental" and "length-based" are opposites.
What? Think of it as "Increments of a specific length".
[...] If you need
to know the length of the input, then you need to read all of the input
first --- and then your hash function isn't incremental.
You need to know the length of each increment. That's a different
thing.
[...] If your hash
function is incremental, then you ought to be able to take the hash (or
hash-function "state") of 'abc...xyz' and turn it into the hash of
'abc...xyzfoo' in constant time, in which case, at no time did it ever
need to know the length of 'abc...xyzfoo'.
I don't follow your reasoning. If you hash 'abc...xyz' you will
implicitely know that its 26 letters no matter what function you use
(you may choose not to compute it in some algorithms, but in the
examples given thus far, doing so makes it substantially slower.) In
the case of my hash function, assuming the block break down is
4-characters at a time, you would compute:
h <- hashincr (0, 'abc...vwx', 24)
buff <- 'yz'
then at some point you would add in the 'foo':
buff <- buff + 'foo'
h <- hashincr (h, buff, buff.length())
then h would be your final output.
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
.
- References:
- Hash Function problem
- From: Hugo
- Re: Hash Function problem
- From: websnarf
- Re: Hash Function problem
- From: websnarf
- Re: Hash Function problem
- From: CBFalconer
- Re: Hash Function problem
- From: websnarf
- Re: Hash Function problem
- From: bob_jenkins
- Re: Hash Function problem
- From: websnarf
- Re: Hash Function problem
- From: Arthur J. O'Dwyer
- Re: Hash Function problem
- From: websnarf
- Re: Hash Function problem
- From: Arthur J. O'Dwyer
- Hash Function problem
- Prev by Date: Re: c++ - book
- Next by Date: re: advice needed
- Previous by thread: Re: Hash Function problem
- Next by thread: Re: Hash Function problem
- Index(es):
Relevant Pages
|
Loading