Re: "index" efficiency... any help or ideas?

From: Alex McDonald (alex_mcd_at_btopenworld.com)
Date: 12/21/03


Date: Sun, 21 Dec 2003 19:47:05 +0000 (UTC)


"xby" <x_b_y@my-deja.com> wrote in message
news:7d9d52e1.0312210819.4a0ff42d@posting.google.com...
> "Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
news:<brqcpn$igm$1@hercules.btinternet.com>...
>
> [...snip...]
> >
> > ... But I think the point is made; a well chosen hash
> > doesn't perform badly. I don't know the context of the statement by the
Prof
> > that "everything could hash to the same place even in the _best_ hashing
> > method". It could; that would just make it the worst hashing method, not
the
> > best.
>
> The book "Algorithms" by Prof. Robert Sedgewick, has a 5-chapter part
> dealing with different searching algorithms. One of these chapters
> thoroughly and clearly explains hashing, with lots and lots of good
> examples. At the end of this chapter the author compares the pros and
> cons of hashing vs. binary searching. I referred to this part of the
> book, in my post dated December 14, 2003, when I wrote as follows:
>
> As for hashing, Sedgewick writes that "it is simpler and can provide
> very fast (constant) search time"

Yes. That's the killer reason for using hashes in this application.

> but it is iffy because "everything
> could hash to the same place even in the best hashing method."

That reference to "the best hashing method" I still don't understand... If
it's that bad, it's not the best, surely.

> Therefore, while hashing is an approach worth considering, it has its
> own caveat.

Yes. But as pointed out, for certain (quite large) classes of data, hashing
does not have this downside. There are several advantages to hashing for
symbol tables used by interpreters and compilers. Searching a hash table is
O(1), whereas a b-tree is O(logM(n)). A classic binary tree is O(log2(n)); M
is equal to the number of entries stored on each leaf. Binary trees are very
poor by comparison with hashes or b-trees, and binary searching should be
avoided as sub-optimal in all except specific cases. Even then it's often
beaten hands down for small sizes (orders of 10's of entries) by simply
searching a table linearly, especially if the table is in frequency of
reference order.

Secondly, adding to hash tables is next to trivial. B-trees are far more
complex. In the symbol table of a compiler, we need to search, but we also
need to add entries quickly. (Windows C/C++ compilers regularly swallow 10s
of thousands of #define statements for instance.)

Thirdly, programmers tend to refer to the same variables regularly within
the same small stretches of code. Even with nearly full hash tables that use
lists pointed from the slot, bringing recently refered variables to the
front of the hash entry list gets you the next reference in a single hit;
the names regularly used percolate to the front.

Downside of hashes; if the data is stored externally, say on disk. In this
case hashes are not to be recommended; b-trees have a better locality of
reference when searching, and hence better performance on disk as there's
less random I/O. Size of the hash table; if collisions are to be avoided,
then the hash table may need to be quite large in comparison to the data
stored (although there are methods that can reduce that).

>
>
>
> > In the worst case, if the binary tree was built in alpha sort order,
> > the entire list would be down one side of the tree. That's exactly the
same
> > outcome; a long list.
>
> True. But this problem is correctable by means of B-tree techniques.
> If you have equivalent corrective techniques applicable to the hashing
> method, I would appreciate it greatly if you would familiarize me with
> them.

In no particular order;

1. Selection of the correct hash. Take a look at an implementation of
Pearson's string hash for instance. Simple, fast and generates an excellent
hash.

2. Perfect hashing, minimal perfect hashing, order preserving minimal
perfect hashing are techniques that require analysis of the data in advance,
and produce perfect hashes; that is, hashes with no collisions. There's no
search technique that's faster. Not practical for symbol tables where the
data isn't known in advance. These techniques are sometimes used by
compilers for optimising large case statements.

3. Re-hashing the hash; if the slot's occupied, rehash the hash. The slots
in this technique aren't chained, but contain only one entry. It's common to
have two hash algorithms for this technique. The first hashes the alpha
string, and is optimised for that task alone (ala Pearson). The second
re-hashes a hash, and is optimised to generate a sequence of non-repeating
hashes from an original hash value. Requires a large hash table with
relatively low occupancy to reduce the hit rate. Downside; it may fail on
large sets if the hash table is too small (all slots full, for instance), or
a badly designed rehash that doesn't rehash over every slot. May perfrm
badly even if the table is nearly empty; depends on the input data.

4. If using a list pointed from a slot table, move the found entry to the
list head. Works well in compilers & interpreters. Simple to implement.

For these reasons and more I would never consider anything other than a hash
symbol table in a compiler. B-trees are just too complex, and (for this
application) slower to boot. Hashing is more than "an approach worth
considering"; it's the first method of choice for in-memory searching of
symbol tables, which is where the OP started.

-- 
Regards
Alex McDonald


Relevant Pages

  • Re: "index" efficiency... any help or ideas?
    ... no general purpose hashing method, no matter how good, is good enough ... to prevent the possibility that everything hash into the same place. ... > Downside of hashes; if the data is stored externally, ... > perfect hashing are techniques that require analysis of the data in advance, ...
    (alt.lang.asm)
  • Re: Why unhashing is not possible?
    ... then it is not a hash. ... (which is where perfect hashing becomes useful) ... Types of hashes and their properties exist apart from why ... ice axes in ice, or digging axes when dealing with a forest fire); ...
    (comp.security.misc)
  • Re: disabling password request when connecting from the LAN
    ... > Seems the hashing is actually applied to each 7-byte half ... That's true only for LM "hashes" (I put that in quotes because it really ... the way the cracking programs divide the hash in two. ... >> Steve Riley ...
    (microsoft.public.win2000.networking)
  • Re: Expanding hash function bit width
    ... >I have an idea to expand the bit width of any hash function H. ... start the hashes by hashing different input ...
    (sci.crypt)
  • Re: How to add a string to a big file in csharp !
    ... Can you just split the names in the files into 26 separate files ... why don't you use a hash table to store the hash of ... when your searching for the ... ignore chunks of data using hashes). ...
    (microsoft.public.dotnet.languages.csharp)