Re: Hash Function problem



Arthur J. O'Dwyer wrote:
On Fri, 2 Mar 2006 websnarf@xxxxxxxxx wrote:
bob_jenkins@xxxxxxxxxxxxxxxx wrote:
I've seen quite a few hashing applications that don't know the length
of the string. One is when the hash is computed at the same time the
original string is read.

Well hang on. Think about this for a minute. How are you reading the
string? Think about the inner loop of the string reading function.
How far away is it from know the length of the string at any time? The
function reading the string itself cannot be well defined unless it
implicitely knows the length of the string it has read -- at least I
cannot imagine such a function.

/* Read a text file and compute its MD5 hash. */
int main(void)
{
md5_data h;
char buf[100];
md5_init(&h);
while (fgets(buf, sizeof buf, stdin) != NULL)
md5_update_nullterminated(&h, buf);
md5_finalize(&h);
printf("MD5 hash of the text file was: %x\n", h.a);
return 0;
}

The hash function is really done as an initializing step, zero or more
update steps, and a finalizing step. At no time does the hash function
need to know the length of the /whole/ bitstring it's hashing --- it
just needs to know a few bytes at a time.

You knew this already, of course --- maybe you just thought of it in
terms of a lot of little hash functions, instead of in terms of a "/whole/
bitstring" taken one piece at a time.

Well my thinking about this is actually that fgets() has screwed you in
terms of performance because it doesn't deliver the length of what it
read to you. See my alternative here:

http://www.pobox.com/~qed/input.html

You see, if this is a performance bottleneck, then your real problem is
that the length that is implicitely calculated in the inner loop of
fgets is being hidden from you.

You can still be clever, however. Lets say you break the fgets() down
into smaller fgets() blocks of size k. You can throw a canary into
buff[k-1] and if after return buff[k-1] is filled with something other
than '\0' then you know that you didn't encounter an interceding '\n'
and therefore can assume the string is larger than n without having to
scan over those bytes. So the final strlen call in the case where you
don't know how much you read, is going to be O(k), not O(n). So the
length-prefix based, incremental hash functions start looking more
attractive. See?

[...] But that's presumably the kind of
hash function Bob is talking about.

I'm sure it is too. But I had a response in between what he wrote and
what you've written here.

[...]
going there anyways. But why wouldn't you, as an alternative, do block
based updates with implicit scanning via length availability anyways?

I have no idea what that means. Is it like what I wrote above?

No, I am talking about the alternative I suggested in response.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.