Re: "index" efficiency... any help or ideas?

From: xby (x_b_y_at_my-deja.com)
Date: 12/23/03


Date: 23 Dec 2003 07:40:12 -0800


"Alex McDonald" <alex_mcd@btopenworld.com> wrote in message news:<bs4tbo$i1v$1@titan.btinternet.com>...
> "xby" <x_b_y@my-deja.com> wrote in message
> news:7d9d52e1.0312210819.4a0ff42d@posting.google.com...

[...snip...]

>
> > but it is iffy because "everything
> > could hash to the same place even in the best hashing method."
>
> That reference to "the best hashing method" I still don't understand... If
> it's that bad, it's not the best, surely.
>

Since I never met Prof. Sedgewick, only read his book, I can only
guess what he meant. My interpretation of this statement of his is -
no general purpose hashing method, no matter how good, is good enough
to prevent the possibility that everything hash into the same place.
Case in point: if collisions are at all possible, as is the case in
any general purpose hashing method, then there is nothing to stop the
end user from selecting keys that lead to the same collision.
Granted, unless deliberately done, the probability of this happening
is extremely small, but it is not zero. It is small enough to justify
your ignoring it as irrelevant, if you so wish, but it is there
nonetheless.

 
> > Therefore, while hashing is an approach worth considering, it has its
> > own caveat.
>
> Yes. But as pointed out, for certain (quite large) classes of data, hashing
> does not have this downside. There are several advantages to hashing for
> symbol tables used by interpreters and compilers. Searching a hash table is
> O(1), whereas a b-tree is O(logM(n)). A classic binary tree is O(log2(n)); M
> is equal to the number of entries stored on each leaf. Binary trees are very
> poor by comparison with hashes or b-trees, and binary searching should be
> avoided as sub-optimal in all except specific cases. Even then it's often
> beaten hands down for small sizes (orders of 10's of entries) by simply
> searching a table linearly, especially if the table is in frequency of
> reference order.
>
> Secondly, adding to hash tables is next to trivial. B-trees are far more
> complex. In the symbol table of a compiler, we need to search, but we also
> need to add entries quickly. (Windows C/C++ compilers regularly swallow 10s
> of thousands of #define statements for instance.)
>
> Thirdly, programmers tend to refer to the same variables regularly within
> the same small stretches of code. Even with nearly full hash tables that use
> lists pointed from the slot, bringing recently refered variables to the
> front of the hash entry list gets you the next reference in a single hit;
> the names regularly used percolate to the front.
>
> Downside of hashes; if the data is stored externally, say on disk. In this
> case hashes are not to be recommended; b-trees have a better locality of
> reference when searching, and hence better performance on disk as there's
> less random I/O. Size of the hash table; if collisions are to be avoided,
> then the hash table may need to be quite large in comparison to the data
> stored (although there are methods that can reduce that).
>
> >
> >
> >
> > > In the worst case, if the binary tree was built in alpha sort order,
> > > the entire list would be down one side of the tree. That's exactly the
> same
> > > outcome; a long list.
> >
> > True. But this problem is correctable by means of B-tree techniques.
> > If you have equivalent corrective techniques applicable to the hashing
> > method, I would appreciate it greatly if you would familiarize me with
> > them.
>
> In no particular order;
>
> 1. Selection of the correct hash. Take a look at an implementation of
> Pearson's string hash for instance. Simple, fast and generates an excellent
> hash.
>
> 2. Perfect hashing, minimal perfect hashing, order preserving minimal
> perfect hashing are techniques that require analysis of the data in advance,
> and produce perfect hashes; that is, hashes with no collisions. There's no
> search technique that's faster. Not practical for symbol tables where the
> data isn't known in advance. These techniques are sometimes used by
> compilers for optimising large case statements.
>
> 3. Re-hashing the hash; if the slot's occupied, rehash the hash. The slots
> in this technique aren't chained, but contain only one entry. It's common to
> have two hash algorithms for this technique. The first hashes the alpha
> string, and is optimised for that task alone (ala Pearson). The second
> re-hashes a hash, and is optimised to generate a sequence of non-repeating
> hashes from an original hash value. Requires a large hash table with
> relatively low occupancy to reduce the hit rate. Downside; it may fail on
> large sets if the hash table is too small (all slots full, for instance), or
> a badly designed rehash that doesn't rehash over every slot. May perfrm
> badly even if the table is nearly empty; depends on the input data.
>
> 4. If using a list pointed from a slot table, move the found entry to the
> list head. Works well in compilers & interpreters. Simple to implement.
>
> For these reasons and more I would never consider anything other than a hash
> symbol table in a compiler. B-trees are just too complex, and (for this
> application) slower to boot. Hashing is more than "an approach worth
> considering"; it's the first method of choice for in-memory searching of
> symbol tables, which is where the OP started.

I agree. But all of this had to be summarized when hashing was
suggested, as a possible solution, in the response to the original
poster query. I reacted to the claim, without any reservations, that
hashing is the best solution. Period. No, it is not. The caveat is
necessary.

As far as I can see, this debate has already provided the original
poster with the information he needs to make his decision.



Relevant Pages

  • Re: Why unhashing is not possible?
    ... then it is not a hash. ... (which is where perfect hashing becomes useful) ... Types of hashes and their properties exist apart from why ... ice axes in ice, or digging axes when dealing with a forest fire); ...
    (comp.security.misc)
  • Re: A question on an article dealing with pass phrase and keys
    ... In the section Keys vs. Passphrases He mentions using a hashing ... first and it goes through the hash function first. ... Hashing the passphrase to produce a key will not increase ... This can be useful for various crypto applications, ...
    (sci.crypt)
  • Re: "index" efficiency... any help or ideas?
    ... > dealing with different searching algorithms. ... > cons of hashing vs. binary searching. ... That's the killer reason for using hashes in this application. ... Searching a hash table is ...
    (alt.lang.asm)
  • Re: Hashing
    ... using elfhash() for the hash function, and a closed hash table 30,241 in ... tokens, but for a first alpha version system is acceptable. ... However, I've isolated the whole hashing functionality, so that swapping it ... (just a little slow when getting above 20,000+ labels). ...
    (alt.lang.asm)
  • Re: Hashing
    ... I hope people understand that the second hash function is used to ... a variation of the open hashing scheme is fine. ... it takes Otries and space to find a collision for any ... of MD5 collision has been found, and a theoretical weakness has ...
    (alt.lang.asm)