Re: using a tree to delete duplicate lines in a text
- From: Ben Bacarisse <ben.usenet@xxxxxxxxx>
- Date: Tue, 21 Apr 2009 23:58:54 +0100
cri@xxxxxxxx (Richard Harter) writes:
On Tue, 21 Apr 2009 19:05:18 +0100, Ben Bacarisse
<ben.usenet@xxxxxxxxx> wrote:
cri@xxxxxxxx (Richard Harter) writes:
On Mon, 20 Apr 2009 19:34:52 -0700, Ben Pfaff
<blp@xxxxxxxxxxxxxxx> wrote:
Jon Harrop <jon@xxxxxxxxxxxxxxxxx> writes:
Ben Pfaff wrote:
Jon Harrop <jon@xxxxxxxxxxxxxxxxx> writes:
A single hash table is great when the key is a datum but when your keys
are sequences much longer than 1 word, like these strings, then you
really want a trie.
This is not common wisdom in the programming world, as far as I
can tell.
On the contrary, that is the sole purpose of the trie data structure and,
consequently, they are extremely widely known and used (see LZW encoders,
for example).
I can't see how a trie would be faster than a hash table for
random access of long string keys. In a trie, you have to
recursively solve the problem of searching a table at each step,
taking either a time or memory hit, and you have to follow a
pointer to arrive at the next node, taking a cache miss penalty
each time (in a large trie). In a hash table, you hash the key
once, then do O(1) comparisons, taking only as many cache miss
penalties as you do comparisons.
The fundamental potential advantage of the tries is that the
process does not have to scan the entire key with a hashing
algorithm. For distinct keys key_length >= log(table_size),
i.e., the tree depth is O(log N) whereas the hash cost is O(L).
In general, yes, but the claim that originally caused Ben Pfaff
surprise was that a trie would obviously beat a hash in the specific
case in question -- filtering duplicate lines.
In this use case, to detect a duplicate the whole string must be
scanned by the trie. And when a string is not found it has to be
added which involves scanning it -- at least from the point when the
difference was detected. This may be cheaper than hashing it, but it
is not obviously so.
Now, it may be that a trie will be faster, but your argument does not
obviously settle the matter for the particular program being
considered.
The same is actually true of binary comparison trees; however the
tries are broad and shallow. For example, consider a table with
a million entries. A binary tree requires at least 20 levels; an
octet based trie may get by with as few as 3 if the number of
entries per octet is relatively uniform.
Again, in general, yes. In the problem that started the debate
(duplicate lines in a large text) the trie will be quite deep.
We have separate issues here. First there is Harrop's original
claim which is overly broad. Then there is Ben Pfaff's
questioning of how this could be, given certain considerations.
My response to Ben should be taken as an explanation of how it
could be, sans a claim that a trie is always better. Beyond that
we have to look at some implementation factors.
In a hash table we have two O(L) costs - hashing the key (the
line in this case) and a verification cost. Both are memory
cheap. Hashing has a relatively expensive computational cost;
verification (checking that the claimed match is real) has a
cheap computational cost. There is also a cost associated with
storing the key or a reference to it and retrieving it.
In a simple minded implementation of a trie there would be L
levels in the trie. However nobody in their right minds would
write such code - the storage requirements are enormous. Some
device or another has to be used to avoid combinatinatorial
explosion.
In a Patricia implementation the edges (the paths from one node
to another) are annotated. The depth is O(N) rather than L. A
nice feature of Patricia trees is that the entire line is
embedded in the tree (as string fragments rather than characters)
so that verification is automatic.
The Patricia scheme is a compression of the structure -- it does not
alter my point which was that the fundamental potential advantage of a
trie might not be realised in an application where scanning the whole
key is inevitable.
(Look it up on wikipedia.)
Hmm...
--
Ben.
.
- Follow-Ups:
- Re: using a tree to delete duplicate lines in a text
- From: Richard Harter
- Re: using a tree to delete duplicate lines in a text
- References:
- using a tree to delete duplicate lines in a text
- From: Franken Sense
- Re: using a tree to delete duplicate lines in a text
- From: Ben Pfaff
- Re: using a tree to delete duplicate lines in a text
- From: Jon Harrop
- Re: using a tree to delete duplicate lines in a text
- From: Ben Pfaff
- Re: using a tree to delete duplicate lines in a text
- From: Jon Harrop
- Re: using a tree to delete duplicate lines in a text
- From: Ben Pfaff
- Re: using a tree to delete duplicate lines in a text
- From: Richard Harter
- Re: using a tree to delete duplicate lines in a text
- From: Ben Bacarisse
- Re: using a tree to delete duplicate lines in a text
- From: Richard Harter
- using a tree to delete duplicate lines in a text
- Prev by Date: Re: Hash tables
- Next by Date: Re: using a tree to delete duplicate lines in a text
- Previous by thread: Re: using a tree to delete duplicate lines in a text
- Next by thread: Re: using a tree to delete duplicate lines in a text
- Index(es):
Relevant Pages
|
Loading