Re: optimized string storage for phrases/dictionary



On Tue, 26 Apr 2005 09:30:09 -0700, dagny_orearde wrote:

> I'm investigating what the optimal data structure for storage of phrases
> that recur multiple times but with small variances. The goal is to use up
> as little memory as possible, be able to look up the original string,
> return it and allow the caller to use it without regard to how it's stored
> internally.
>
> Some of the sample data might look like this:
> hello_there/here_is_very_long_string/abcdefghijklmnopq
> hello_there/here_is_very_long_string/abcdefghijklmnopr
> hello_there/here_is_very_long_string/abcdefghijklmnost
> hello_there/here_is_very_long_string/abcdefghijklmnvsd
> hello_there_2/here_is_very_long_string/abcdefghijklmnopq
> hello_there_2/here_is_very_long_string/abcdefghijklmnopr
> hello_there_2/here_is_very_long_string/abcdefghijklmnost
> hello_there_3/here_is_very_long_string/abcdefghijklmnvsd
>
> I'll have millions of strings like this, some of them are much longer than
> what's listed above whereas others maybe short.
> ...
> Anyone have any suggestions or ideas?

If the set of phrases is pretty static, you can just treat it as a
text compression problem. An exceeding simple-minded approach is to
create a Huffman code and represent each string with its encoding.
They are unique, so the user can compare two phrases. To process it
textually the user would have to ask for the actual string, which you
provide by decoding.

In this case you really want to do something like LZW coding followed
by Huffman coding, etc.
http://www.nist.gov/dads/HTML/lempelZivWelch.html

Minor changes could be accommodated with e.g., adaptive Huffman coding.
http://www.nist.gov/dads/HTML/adaptiveHuffman.html

Best of luck.

-paul-
--
Paul E. Black (p.black@xxxxxxx)

.