# Re: Hash Function problem

*From*: websnarf@xxxxxxxxx*Date*: 2 Mar 2006 23:22:58 -0800

Arthur J. O'Dwyer wrote:

On Fri, 2 Mar 2006 websnarf@xxxxxxxxx wrote:

bob_jenkins@xxxxxxxxxxxxxxxx wrote:

I've seen quite a few hashing applications that don't know the length

of the string. One is when the hash is computed at the same time the

original string is read.

Well hang on. Think about this for a minute. How are you reading the

string? Think about the inner loop of the string reading function.

How far away is it from know the length of the string at any time? The

function reading the string itself cannot be well defined unless it

implicitely knows the length of the string it has read -- at least I

cannot imagine such a function.

/* Read a text file and compute its MD5 hash. */

int main(void)

{

md5_data h;

char buf[100];

md5_init(&h);

while (fgets(buf, sizeof buf, stdin) != NULL)

md5_update_nullterminated(&h, buf);

md5_finalize(&h);

printf("MD5 hash of the text file was: %x\n", h.a);

return 0;

}

The hash function is really done as an initializing step, zero or more

update steps, and a finalizing step. At no time does the hash function

need to know the length of the /whole/ bitstring it's hashing --- it

just needs to know a few bytes at a time.

You knew this already, of course --- maybe you just thought of it in

terms of a lot of little hash functions, instead of in terms of a "/whole/

bitstring" taken one piece at a time.

Well my thinking about this is actually that fgets() has screwed you in

terms of performance because it doesn't deliver the length of what it

read to you. See my alternative here:

http://www.pobox.com/~qed/input.html

You see, if this is a performance bottleneck, then your real problem is

that the length that is implicitely calculated in the inner loop of

fgets is being hidden from you.

You can still be clever, however. Lets say you break the fgets() down

into smaller fgets() blocks of size k. You can throw a canary into

buff[k-1] and if after return buff[k-1] is filled with something other

than '\0' then you know that you didn't encounter an interceding '\n'

and therefore can assume the string is larger than n without having to

scan over those bytes. So the final strlen call in the case where you

don't know how much you read, is going to be O(k), not O(n). So the

length-prefix based, incremental hash functions start looking more

attractive. See?

[...] But that's presumably the kind of

hash function Bob is talking about.

I'm sure it is too. But I had a response in between what he wrote and

what you've written here.

[...]

going there anyways. But why wouldn't you, as an alternative, do block

based updates with implicit scanning via length availability anyways?

I have no idea what that means. Is it like what I wrote above?

No, I am talking about the alternative I suggested in response.

--

Paul Hsieh

http://www.pobox.com/~qed/

http://bstring.sf.net/

.

**Follow-Ups**:**Re: Hash Function problem***From:*Arthur J. O'Dwyer

**References**:**Hash Function problem***From:*Hugo

**Re: Hash Function problem***From:*websnarf

**Re: Hash Function problem***From:*websnarf

**Re: Hash Function problem***From:*CBFalconer

**Re: Hash Function problem***From:*websnarf

**Re: Hash Function problem***From:*bob_jenkins

**Re: Hash Function problem***From:*websnarf

**Re: Hash Function problem***From:*Arthur J. O'Dwyer

- Prev by Date:
**Re: Best copying stable sort algorithm?** - Next by Date:
**Re: Hash Function problem** - Previous by thread:
**Re: Hash Function problem** - Next by thread:
**Re: Hash Function problem** - Index(es):