Re: optimized string storage for phrases/dictionary



On Tue, 26 Apr 2005 09:30:09 -0700, dagny_orearde wrote:

> I'm investigating what the optimal data structure for storage of phrases
> that recur multiple times but with small variances. The goal is to use up
> as little memory as possible, be able to look up the original string,
> return it and allow the caller to use it without regard to how it's stored
> internally.
>
> Some of the sample data might look like this:
> hello_there/here_is_very_long_string/abcdefghijklmnopq
> hello_there/here_is_very_long_string/abcdefghijklmnopr
> hello_there/here_is_very_long_string/abcdefghijklmnost
> hello_there/here_is_very_long_string/abcdefghijklmnvsd
> hello_there_2/here_is_very_long_string/abcdefghijklmnopq
> hello_there_2/here_is_very_long_string/abcdefghijklmnopr
> hello_there_2/here_is_very_long_string/abcdefghijklmnost
> hello_there_3/here_is_very_long_string/abcdefghijklmnvsd
>
> I'll have millions of strings like this, some of them are much longer than
> what's listed above whereas others maybe short.
> ...
> Anyone have any suggestions or ideas?

If the set of phrases is pretty static, you can just treat it as a
text compression problem. An exceeding simple-minded approach is to
create a Huffman code and represent each string with its encoding.
They are unique, so the user can compare two phrases. To process it
textually the user would have to ask for the actual string, which you
provide by decoding.

In this case you really want to do something like LZW coding followed
by Huffman coding, etc.
http://www.nist.gov/dads/HTML/lempelZivWelch.html

Minor changes could be accommodated with e.g., adaptive Huffman coding.
http://www.nist.gov/dads/HTML/adaptiveHuffman.html

Best of luck.

-paul-
--
Paul E. Black (p.black@xxxxxxx)

.



Relevant Pages

  • Re: how do I do hex to single precision conversion??
    ... Sorry for mix-up on terminology, i was using the phrases from the ... babbage website =) ... When comparing with the babbage URL, this string is ...
    (comp.lang.tcl)
  • Re: Im a korean. I wanna play stonesoup to korean lang
    ... When you have a string ... you look it up in a message catalog. ... display the text: "An orc." ... Not when you have more complex phrases like: ...
    (rec.games.roguelike.misc)
  • Re: referring to increasing VBA variable name in loop
    ... Dim iCtr as long ... I am using VBA to put together some phrases to be spoken with the text ... What I've got is a list of string variables, ...
    (microsoft.public.excel.programming)
  • Re: optimized string storage for phrases/dictionary
    ... The set of phrases is static in the sense that one particular phrase ... the string and it's not limited to the end of the string. ... > create a Huffman code and represent each string with its encoding. ... > by Huffman coding, etc. ...
    (comp.theory)
  • Re: Confused regarding text example
    ... So basically, WordIndex's obj could be a file, an array of strings, etc. ... Phrases is basically a series of phrases (my mind just went blank on the ... the object is pushed into the Hash that had the wanted string. ...
    (comp.lang.ruby)