Re: Extreamly large Hashtable
- From: "Chris Uppal" <chris.uppal@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Tue, 10 May 2005 11:21:02 +0100
David McDivitt wrote:
> A binary search is the fastest and most easily implemented scheme, when it
> can be used.
I don't know if you really mean this as it sounds, but taken literally it is
quite simply false.
> This is because binary searches are at the core of any other
> lookup scheme, including databases going through an index.
The same goes for this.
It /is/ true that tree-based indexing is the norm in high-performance DB
indexing, but note that:
a) the "binary" search in question isn't a binary chop over a sorted array --
it'll be a complex implementation of some sophisticated balanced tree
algorithm.
b) the reason for using a balanced tree for the index rather than hashing has
more to do with the efficiency with which elements can be added and removed,
than lookup efficiency. In fact a hashtable (with a good hash) will beat a
balanced tree implementation for lookup.
> By laying out a
> sorted flat file and doing a low level binary search directly on that flat
> file, much overhead is eliminated.
Very inefficient. Comparing just the case of a hastable laid out as a large
flat file, vs the same data laid out in sort order and accessed via a binary
chop, the hashtable will be /far/ faster. The problem is that the sorted
version will require O(log n) disk seeks to find the wanted element, whereas
the hashtable will require 1 disk seek (typically) to do the same. Disk seeks
are /slow/, doing just 10 of them consumes a substantial fraction of a second.
OTOH, the hashtable would have to be larger, in order to keep hash collisions
to an acceptable minimum. But it's surprising how little extra space is needed
in this particular case. A hashtable with a load as high as 90% works fine for
this, which is not what you'd normally expect. It's because the access time is
dominated by disk-seek time, and one seek followed by a read of a block of data
from the disk will normally pull in quite a few keys (depending on key-size, of
course) and provided that the hash collision can be resolved without needing to
read another block, the cost of the extra comparisons is essentially zero.
Note that both implementations would suffer from the problem that inserting or
deleting records could be /very/ expensive. The sorted table would suffer
rather worse from this problem than the hashtable, and the jiggery-pokery
required to reduce the problems would be harder to implement for the sorted
table, but neither is really suitable for anything except very special
purposes.
-- chris
.
- References:
- Extreamly large Hashtable
- From: anthony
- Re: Extreamly large Hashtable
- From: Eric Sosman
- Re: Extreamly large Hashtable
- From: anthony
- Re: Extreamly large Hashtable
- From: Eric Sosman
- Re: Extreamly large Hashtable
- From: anthony
- Re: Extreamly large Hashtable
- From: David McDivitt
- Re: Extreamly large Hashtable
- From: Chris Uppal
- Re: Extreamly large Hashtable
- From: David McDivitt
- Extreamly large Hashtable
- Prev by Date: Re: Generics are cool
- Next by Date: Re: JVM optimization
- Previous by thread: Re: Extreamly large Hashtable
- Next by thread: Re: Extreamly large Hashtable
- Index(es):
Relevant Pages
|
|