Re: Suggestions for double-hashing scheme



CBFalconer wrote:
> websnarf@xxxxxxxxx wrote:
> > CBFalconer wrote:
> ... snip ...
> > > Meanwhile the use of a prime table size has the advantage of
> > > simplicity and robustness.
> >
> > Ok, first of all its not robust. As I recally, your solution, for
> ... snip ...
> >
> > And as to simplicity, remember you have to pick your hash table
> > size from a prime list (which you subtract from powers of 2, for
> > some reason) -- and for some bizarre reason, that escapes me, you
> > decided 8 million was as high as you were going to go. Furthermore
> > you are, for additionally unexplained reasons, scaling your
> > secondary hash function to 1/4 of the table size, and you have to
> > ensure that its non-zero. With my method, you pick power of two
> > table sizes and you just pick the second probe skip out of a small
> > table. No scaling issues, smaller code, and simply sounder.
>
> All of which shows Hsieh has no conception of what is going on. I
> shall no longer refrain, in this thread, from pointing out that
> Hsieh is considered a laughing stock in many circles, that Hsieh
> has published and claimed portability for sadly lacking code (such
> as assuming that all ints are 32 bits, and that shift operations
> can be portably performed on signed entities); and that Hsiehs
> bstring package is unusable for human beings, in that it contains
> something over 70 access functions, myriad typedefs, etc., and is
> almost certainly subject to the foul portability problems already
> seen, and who knows what else. He appears to become uncontrollable
> when such failings are pointed out.

Besides basically being a misrepresentation, what has this got to do
with hash tables?

1. Something over 70 access function

It didn't start out this way, I've have had numerous requests to expand
the functionality. Its very usable as is evidenced by the fact that
its *VERY USED*. Read what Zed A. Shaw wrote here:

http://www.zedshaw.com/projects/myriad/introduction/node28.html

And he's one of the people who *didn't* email me.

I also don't know where "over 70" comes from. The main module is less
than 70 functions, but taken as a whole, the library is well over 100
functions. The naming of the functions is very logical, its reasonably
documented, and there are numerous examples that explain how to use the
functions.

2. myriad typedefs

I count three. Two of them are just assistants to make the syntax for
two function pointers cleaner (from my experience, getting the syntax
right for function pointers seems to be one of the biggest challenges
for average C programmers.) The other one declares a bstring.

3. subject to the foul portability problems already seen

Seen where? There was one bug report related to this, basically within
about a month of me releasing the library (2.5 years ago or so). About
the only other compiler issue has been a misdetection of Borland C as
Turbo C at one point (Borland C also defines __TURBOC__ for some
reason; Turbo C is not an ANSI compliant compiler and needs special
support). As of this moment there are at least 7 different compilers
that have no problems (not even a warning or notation at the highest
strictness settings of all these compilers) with it that I know of
first hand. People on other OSes that I don't currently have access to
such as FreeBSD, Mac OS X, Irix, etc seem to have no problems with it.

> For the benefit of the onlookers, my hashlib package requires that
> a user supply several customization functions, whose exact types
> are specified in hashlib.h. These are:
>
> 2 functions of type hshfn
> 1 each of types hshcmpfn, hshdupfn, hshfreefn
> 1 (optional) of type hshexecfn to implement hshwalk (below)

Thats more typedefs than in all of bstrlib already.

> These are passed to the system once at execution of hshinit, which
> returns a pointer to an anonymous structure. This is similor to
> the FILE* returned from fopen, and is used in a similar manner to
> identify the table in use. Table use is performed by the following
> functions (all with a prefix 'hsh'):
>
> find, insert, delete
>
> and there are auxiliary functions
>
> walk, status
>
> for scanning the table content, returning statistics, etc. There
> is a further type defined for the struct (record) returned by the
> status call.

Ok, that's another typedef ...

> [...] This makes evaluating the efficacy of hash functions
> a trivial exercise, since counts are kept of probes and misses.
>
> The table is closed, and all storage retclaimed by a call to
> hshkill.
>
> The system is designed to handle storage of up to about 8e6
> different items with no special care.

So is appending the characters: [8000000] to the end of a variable
declaration.

BTW, what does your library do on a 16 bit system? You wouldn't allow
a wrap around in the argument to malloc to silently underallocate would
you?

> [...] The source specifies how to
> expand the upper limit. I consider the limit more than adequate
> for in-memory tables, especially since that limit will probably tie
> up over 128 MB of actual memory (at only 32 bytes per dataitem).

So you've never heard of "Moore's Law" have you? How about "640K ought
to be enough for anyone"? If this is just about what *you* consider
adequate, then why are *you* suggesting that anyone use what's been
custom fit just for *you*? (Compare this to Bstrlib, whose influence
from others is attested to in the acknowledgements.)

> The code is entirely portable re-entrant standard C, and has
> received no bug reports for years (apart from documentation).

I have no idea how comparable to Bstrlib this is, as I don't know what
kind of size of audience you are supporting. bstrib.sf.net is
currently processing about 9K hits a month. I've recieved 4 actual bug
reports from users in total that were serious enough to warrant an
immediate fix and patch release. (The vast majority of feedback has
been feature requests, optimization requests, or in the case of really
amateur users, just basic programming help wrt using it.)

Bstrlib also clearly covers vastly more functionality that your hash
library. It looks like you support 7 functions (init/deinit + the five
you've listed) which by even your bizarre counting is less than one
tenth of what Bstrlib covers.

> By contrast, when I downloaded Hsiehs bstring system a couple of
> months ago it appears to have been revised roughly weekly,
> indicating some shortcomings in design or implementation.

Its called feature requests, performance tuning, documentation updates
and so on. Every version for the last year is essentially stable and
usable, which I know because lots of people have been using it.

I also clearly have *MUCH* higher standards than you. Even after
pointing out the 8 million entry flaw (or whatever the exact limit was)
your response was simply to claim that its not a bug or a problem at
all. After explaining to you multiple times that "%" is a horrendously
slow operation (which you apparently still don't believe) you just
ignore this.

By comparison, a guy working for SGI emailed me to tell me that my
usage of vsnprintf() had *suboptimal* performance on IRIX workstations.
So I recoded a solution to be simultaneously optimal on all the
platforms I knew about and the IRIX at the same time. I also deal with
the fact that Turbo C doesn't support strftime() (which is really a
Turbo C problem). People have asked me for whitespace trimming
functions, replace functions, case changing functions,
overwrite-with-sub-string functions, etc. These are the kinds of
things *I* deal with, instead of ignoring them.

> Hashlib operates on data items through the supplied functions, thus
> the actual data format is entirely under the control of the user.
> The library does not care. Similarly, the library does not care
> how good or bad the hash functions are - they only affect the
> overall efficiency, and not the accuracy.

And in all of this, have you benchmarked the code and compared it
against, say, James Dow Allen's code? Have you profiled it? Have you
performed some reasonable mathematical analysis to characterize your
design. Have you directly compared it to open/chain-style hash tables?
And what is the state of the documentation for your hash library?

In view of your answers, how can you seriously convince anyone to
consider using your library instead of rolling their own custom built
solution? The advantage of a custom solution is that it can be type
safe instead of using void * generic pointers (so you can't mix up hash
tables on two different types without the compiler yelling at you).

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: Suggestions for double-hashing scheme
    ... >> secondary hash function to 1/4 of the table size, ... Hsieh *obviously* has far more of a clue on this ... > and that Hsiehs bstring package is unusable for human beings, ... "I don't really care about being right you know, ...
    (comp.programming)
  • Re: Detecting Brute-Force and Dictionary attacks
    ... usually modern systems doesn't compare the password you ... password attempt with the saved hash of your current password. ... It is my personal opinion that evaluating the passwords so closely, such as mentioned in the previous email to Greg's, would open yourself up to a far more complicated world than you might be thinking. ... In the past, when I've had people attempting to attack my systems, the easiest way to tell is the number of login attempts against the frequency of the attempts. ...
    (Focus-Linux)
  • Re: How good are checksums?
    ... >checksum and compare file lengths. ... >duplicate very low. ... I use a fast Adlerian checksum. ... Your problem is similar to what to do with hash table collisions. ...
    (comp.lang.java.programmer)
  • Re: Byte to byte compare, duplicate file finder/killer
    ... cryptographically strong hash, simply because the CRC ... By using a secure hash instead of CRC, the actual byte-to-byte compare ... checksum (files with identical CRCs or checksums are trivial to ...
    (comp.programming)
  • Compare datasets with hash values or possibly another way?
    ... hash values of the datasets and compare those. ... ' Create a new instance of memory stream ... Dim formatter As New BinaryFormatter ...
    (microsoft.public.vstudio.general)