Re: Hash functions (was: Maximum String size in Java?)
From: CBFalconer (cbfalconer_at_yahoo.com)
Date: 03/13/05
- Next message: websnarf_at_gmail.com: "Re: Variation of bit-counting"
- Previous message: Willem: "Re: Variation of bit-counting"
- In reply to: websnarf_at_gmail.com: "Re: Hash functions (was: Maximum String size in Java?)"
- Next in thread: websnarf_at_gmail.com: "Re: Hash functions (was: Maximum String size in Java?)"
- Reply: websnarf_at_gmail.com: "Re: Hash functions (was: Maximum String size in Java?)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 13 Mar 2005 11:31:34 GMT
websnarf@gmail.com wrote:
> CBFalconer wrote:
>> websnarf@gmail.com wrote:
>
... snip ...
>
>> [...] and the possible (not at all sure here) reliance on
>> the input strings being 32 bit aligned.
>
> How dense are you? SFH is *INSENSITIVE* to alignment. It makes
> no difference. This is just something you made up that has no
> relation to anything.
You do ramble on. On closer inspection (which should not be
necessary in an alleged portable routine) I see that you have
forced byte by byte access for anything not known to be a 386
derived system, by cleverly concealing the real action beneath
customization macros. That means that the fault will only show up
via timing tests. What a mess of worms.
>
>> [...] In addition the code performed shifting operations
>> on signed integers,
>
> Where does it do this?
It doesn't in the code I used, because I corrected it.
Addendum: I may be mistaken here. I distinctly remember
correcting ints to longs, as the original tried to shoehorn 32 bit
quantities into possibly 16 bit buckets.
>
>> [...] which is very very evil.
>
> Not outside of comp.lang.c it isn't. Signed right shift is
> something real programmers use as necessary.
Now here is something that really makes every claim of yours
suspect. Real programmers do not cavalierly invoke undefined or
implementation defined behaviour.
>
>> [...] The version in my test code has all but the (possible)
>> alignment problems more or less handled,
>
> This is nonsense. On systems that have 64 bit unsigned longs (of
> which there are definately a few) the hash calculations will be
> clearly wrong in your embodiment.
Where does 64 appear? This is typical of your attitude to correct
code. You have hidden assumptions all over that an int consists of
32 functional bits. There is no such guarantee in the C standard.
Thus your code is inherently suspect and untrustworthy, as you just
stated. It won't even survive a transition to 64 bit machinery.
>
... snip ...
>
> Yes, but strlen is *NOT* an acceptable source for such length.
> Certainly not in performance sensitive situations.
>
>> [...] Feel free to suggest another source.
>
> My hash test mechanism never calls strlen, or any equivalent.
In the real world it is necessary to connect existing software and
data to new software and data. In that real world 99% of the
strings in C code are nul terminated, and do not carry a built in
length. Don't even suggest your bstrlib, which is suspect from
your own comments about non-ASCII chars below.
... snip ...
>
> Because you're not measuring the right things, and you are crippling
> SFH for some reason.
>
>> [...] This leaves simplicity, clarity and correctness as the
>> major selection criteria.
... snip ...
>
>> Very few lines differ only in their last characters. This is
>> clearly shown by the profile.
>
> I.e., you picked data which works well for your hash.
On the contrary, I picked a suitably large text file that was
already present on my system, and measured the results. If you
would prefer a different data file here is one that is only about
240 kB:
[1] c:\c\hshftst>hshftst < ..\hashlib\psalms.txt
Time for some elementary operations:
Input 4922 lines in 0.604 secs
A null hash function call nullfn 0.009 secs
Hashing with sx31hsh in 0.073 secs
Hashing with ssfh_ph in 0.082 secs
Statistics for storing full lines:
2460 duplicated lines
Entries = 2462 Probes = 13892 Misses = 5747
Inserted with sx31hsh and ssfh_ph in 0.385 secs
2460 duplicated lines
Entries = 2462 Probes = 14029 Misses = 5884
Inserted with ssfh_ph and sx31hsh in 0.330 secs
2460 duplicated lines
Entries = 2462 Probes = 14005 Misses = 5860
Inserted with sx37hsh and sx31hsh in 0.330 secs
Time to parse words from the source:
Found 45146 words
Parsing words in 1.099 secs
Parsing and hashing words with sx31hsh in 1.209 secs
Parsing and hashing words with ssfh_h in 1.374 secs
Parsing and hashing words with sx37hsh in 1.264 secs
Statistics for parsing, hashing and storing those words:
37447 duplicated words
Entries = 7699 Probes = 94889 Misses = 35846
Inserted with sx31hsh and ssfh_ph in 2.363 secs
37447 duplicated words
Entries = 7699 Probes = 95639 Misses = 36596
Inserted with ssfh_ph and sx31hsh in 2.418 secs
37447 duplicated words
Entries = 7699 Probes = 93503 Misses = 34460
Inserted with sx37hsh and sx31hsh in 2.253 secs
>
>> You seem to fail to appreciate that one of the primary performance
>> factors for a hash table is the loading level.
>
> This affects a *CLOSED* hash table, and I am very aware of this.
That statement by itself shows your lack of awareness. Possibly
just muddy thinking or failure to express yourself clearly.
... snip ...
>>>
>>> So your test can't possibly measure anything of relevance. And
>>> BTW, there are non-ASCII characters in n869_txt.
>>
>> So what? How does that affect its use as test source?
>
> I spent like 15 minutes tracking down what I thought was the first
> ever data corruption bug in Bstrlib. It would have been nice to
> know that the source data itself was corrupted.
The non-ASCII characters are hex b1 and d7, I believe. They fall
neatly into the definitions of unsigned chars in C. The fact that
your string library has problems with them speaks for itself. I am
sure the Scandinaves, to mention only one group, will love those
failings. It sounds like the worst possible string library for
them. BTW, the standard library string routines seem to have no
problems with them.
... snip ...
>
> You would, of course, like to cite a recent instance where I have
> presented non-portable code?
You mentioned your bstrlib and its failing just above. You
published your SFH code with the shifted and undersized ints. Is
the past few weeks too constricting?
... snip ...
>>
>> What is not fair?
>
> 1. You don't account for the fact that SFH needs only to be called
> once for both primary and secondary hashes.
> 2. You call strlen for every call to SFH.
> 3. You don't perform hash precheck, an important performance
> consideration for hash tables in general, where SFH is just clearly
> superior to x37 or x31.
> 4. You are dilluting the performance advantage of SFH by using it in a
> mechanism which has other glaring performance problems.
And all of these do not matter. The performance considerations are
buried in other factors, such as i/o. Most importantly, there
appears to be no guarantee of correctness allowed for you to
consider the tests fair.
>>From this viewpoint you are digging yourself in deeper and deeper.
The only thing you seem to be muttering about is the use and
accuracy of SFH. You have even admitted that it won't survive any
system without a 32 bit int, and your original code wouldn't even
survive that due to overflows.
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. <http://cbfalconer.home.att.net> USE worldnet address!
- Next message: websnarf_at_gmail.com: "Re: Variation of bit-counting"
- Previous message: Willem: "Re: Variation of bit-counting"
- In reply to: websnarf_at_gmail.com: "Re: Hash functions (was: Maximum String size in Java?)"
- Next in thread: websnarf_at_gmail.com: "Re: Hash functions (was: Maximum String size in Java?)"
- Reply: websnarf_at_gmail.com: "Re: Hash functions (was: Maximum String size in Java?)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|