Re: Get reference to object in Set



On Wed, 5671 Sep 1993 08:33:14 -0700, Mike Schilling wrote:

blue indigo wrote:
Actually, the big problem with implementing anything like this in
Java is that a Java char is 16 bits wide. Storing 65536 child
pointers (even when they'll mostly be nulls) in every node is not
going to be very efficient. Nor is converting every string to UTF-8
on the way into its API.

Patricia trees don't require that, any more than they require 256
pointers per node on ASCII-based systems. The decision tree is
per-bit, so each node has at most two children, one for 0 and one for
1.

The Wikipedia article (and my own past experience with radix trees)
strongly implies that it is typical to use a small array (but bigger than
two).

I thought of splitting the char into nybbles and having arrays of 16
children; I don't know if you'd get any better speed performance than a
HashMap<String,Foo> though when you account for the overhead of doing a
bunch of bit-twiddling (masking and shifting, mainly) on the chars and
traversing four pointers per char rather than one, and accounting for Java
Strings memoizing their hashes.

Figuring out an optimization based on the likelihood that, in a typical
application, the high bytes will tend to mainly have one or a very few
values seemed nontrivial when I tried it. Stripping information (such as
the high byte) creates the possibility of collisions and then you need
hash buckets.

Certainly with strings or anything else representable as a long bit stream
whose length is divisible by n, any radix 2^k with k dividing n can be
used as the tree radix though. In this case, 2, 4, 8, and 16 are possible
(since Java's char is 16 bits), but I'd decided that 8 and 16 looked
unwieldy due to array size and 2 due to the depth of the resulting trees.
That's why I was leaning toward 4.

As for the Set<String> version, one application I thought of involves
anything that resembles web spidering or similar forms of search
technology where resources' handles are long strings that often share
common prefixes. URLs in particular have this trait. Storing the set
of seen-but-not-visited-yet URLs could be vastly more efficient using a
patricia tree than using a HashMap or similar structure. You eliminate
duplicates and when you hit the same URL from multiple places, you can
efficiently find and update an associated value (say, to boost its page
rank or whatever). Moreover though, you don't take up dozens of bytes
for every occurrence of
"http://some.really.com/long/url/prefix/with/bells/on/"; in a site with
deeply nested directory structures, nor the overhead of a hashtable
(typically 33% space overhead for the slack space and then some more for
the hashes themselves -- each Java String will memoize its hash when its
hash is first used -- and some time overhead for sometimes growing the
array, for hashing, and for equals() tests; the tree effectively only does
the latter).

Another case where it clearly wins over HashFoo seems to be if you want to
find by prefix. Incremental search, tab-completion, and similar UI
features can use this, the former in conjunction with an index of some
sort (e.g. a patricia tree of document words associating them with some
information about their locations) and the latter with a history of some
sort (stored similarly).

A possible application then is an html browser optimized for local
documentation viewing, which might want all of:
* A tab-completing history in the address box
* Incremental search in page
* The ability to spider a local documentation tree and create an index;
local filesystem URLs are especially likely to have very long common
prefixes (and javadocs are particularly guilty in my experience);
I'd love to have the ability to "google" my local javadocs! Especially
since googling Sun's web site for javadocs not only requires more
typing (adding "site:sun.com" to the query) but tends to find random
versions of everything (JCheckBox: 1.4.2; HashMap: 1.4.2; ImageIO:
1.4.2; SuppressWarnings: 5.0; etc.; using "I'm Feeling Lucky"; 6.0 is
current whereas 1.4.x is being sunsetted) and can't be done offline.
Plus I can only search one documentation tree at a time; I need a
different "site:" bit to search, say, the Apache Commons docs instead of
Sun's. (On the plus side, a URL found by Google can be posted or mailed
and it will work for other people.)

--
blue indigo
UA Telecom since 1987
.



Relevant Pages

  • Re: Three Kinds of Logical Trees
    ... >> That strikes me as a nonstardard definition of the use of metadata, ... I would be surprised if all you really care about is strings. ... Having the tree as the only possible structure is worse than having ... > SQL does not handle well. ...
    (comp.databases.theory)
  • Re: NEED HELP WITH A PROGRAM....ANYONE PLEASE HELP!
    ... The first plateau is to construct a tree ... There are three parts to reaching this first plateau. ... char* get_word ... int createBinaryTree() ...
    (comp.lang.c)
  • Re: NEED HELP WITH A PROGRAM....ANYONE PLEASE HELP!
    ... The first plateau is to construct a tree ... There are three parts to reaching this first plateau. ... reading a file, isolating the words, and adding the words to a tree. ... char* get_word ...
    (comp.lang.c)
  • Re: adapting getline
    ... I'm pretty sure you meant tree here, ... int getline ... int getline(FILE *, char *, int); ... void print_tree(struct tnode *); ...
    (comp.lang.c)
  • Re: XMas tree lights? Jeez
    ... >lights built into the tree are out. ... >Now it did say when ONE goes out, they dont ALL go out in the strings. ... >Tree is going STRAIGHT BACK to fortunoff this weekend, ... Series-wired lights can be tested easily if you know how they're ...
    (alt.home.repair)