optimized string storage for phrases/dictionary
- From: dagny_orearden@xxxxxxxxx
- Date: 26 Apr 2005 09:30:09 -0700
I'm investigating what the optimal data structure for storage of
phrases that recur multiple times but with small variances. The goal is
to use up as little memory as possible, be able to look up the original
string, return it and allow the caller to use it without regard to how
it's stored internally.
Some of the sample data might look like this:
hello_there/here_is_very_long_string/abcdefghijklmnopq
hello_there/here_is_very_long_string/abcdefghijklmnopr
hello_there/here_is_very_long_string/abcdefghijklmnost
hello_there/here_is_very_long_string/abcdefghijklmnvsd
hello_there_2/here_is_very_long_string/abcdefghijklmnopq
hello_there_2/here_is_very_long_string/abcdefghijklmnopr
hello_there_2/here_is_very_long_string/abcdefghijklmnost
hello_there_3/here_is_very_long_string/abcdefghijklmnvsd
I'll have millions of strings like this, some of them are much longer
than what's listed above whereas others maybe short.
Typically, if you have the same string occuring multiple times, you can
use reference counted strings. But here, it's slight difference since
the strings aren't exactly indentical but just slightly variant enough
at different places.
I can split up the strings using "/" as a delineator but in terms of
the optimal data structure for storage, I'm not sure if a tree would be
ideal or something else.. I'm sure there's been some sort of research
done on this problem and probably a name for it.
Anyone have any suggestions or ideas?
Thanks,
Dagny
.
- Follow-Ups:
- Re: optimized string storage for phrases/dictionary
- From: Jamie Andrews; real address @ bottom of message
- Re: optimized string storage for phrases/dictionary
- From: Nameless
- Re: optimized string storage for phrases/dictionary
- From: Paul E. Black
- Re: optimized string storage for phrases/dictionary
- Prev by Date: Re: P=NP: Linear Programming Formulation of the TSP
- Next by Date: Re: discuss dancing links
- Previous by thread: Re: discuss dancing links
- Next by thread: Re: optimized string storage for phrases/dictionary
- Index(es):