Re: optimized string storage for phrases/dictionary



<dagny_orearden@xxxxxxxxx> wrote in message
news:1114533009.892165.299080@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> I'm investigating what the optimal data structure for storage
> of phrases that recur multiple times but with small variances.
> The goal is to use up as little memory as possible, be able
> to look up the original string, return it and allow the caller
> to use it without regard to how it's stored internally.
>
> Some of the sample data might look like this:
> hello_there/here_is_very_long_string/abcdefghijklmnopq
> hello_there/here_is_very_long_string/abcdefghijklmnopr
> hello_there/here_is_very_long_string/abcdefghijklmnost
> hello_there/here_is_very_long_string/abcdefghijklmnvsd
> hello_there_2/here_is_very_long_string/abcdefghijklmnopq
> hello_there_2/here_is_very_long_string/abcdefghijklmnopr
> hello_there_2/here_is_very_long_string/abcdefghijklmnost
> hello_there_3/here_is_very_long_string/abcdefghijklmnvsd
>
> I'll have millions of strings like this, some of them are much
> longer than what's listed above whereas others maybe short.
>
> Typically, if you have the same string occuring multiple times,
> you can use reference counted strings. But here, it's slight
> difference since the strings aren't exactly indentical but just
> slightly variant enough at different places.
> I can split up the strings using "/" as a delineator but in
> terms of the optimal data structure for storage, I'm not sure
> if a tree would be ideal or something else.. I'm sure there's
> been some sort of research done on this problem and probably a
> name for it.
>
> Anyone have any suggestions or ideas?

Ternary tree, perhaps?

---
Mail sent to this email address is deleted unread
on the server. Please send replies to the newsgroup.


.



Relevant Pages

  • optimized string storage for phrases/dictionary
    ... phrases that recur multiple times but with small variances. ... I'll have millions of strings like this, some of them are much longer ... Typically, if you have the same string occuring multiple times, you can ... the optimal data structure for storage, I'm not sure if a tree would be ...
    (comp.theory)
  • Re: optimized string storage for phrases/dictionary
    ... > phrases that recur multiple times but with small variances. ... I would suggest a "trie", ...
    (comp.theory)