Iterating & Comparing String Data Sets Efficiently?



I have data consisting of lists of strings, which I need to compare
versus each other, to find which strings each set has in common.

As an example, say I have this set, called my-data, composed of 5 lists
of strings:

my-data

(("fancy" "ourselves" "creative" "DIY" "types" "mention" "cheap"
"furniture"
"pile" "pillows" "roof" "tarp")
("shelling" "grade" "stuff" "folks" "dates" "breakfast" "sponsoring"
"tasty"
"consisting" "shots")
("photographers" "saving" "pennies" "roof" "burns" "knees" "starting"
"awfully")
("painful" "Tiny" "Dispatch" "Couches" "Valley" "Italian" "Chair"
"Pillows")
("Ladies" "breakfast" "posh" "lunches" "breed" "female" "charity"
"stuff"))

What I want to do is compare the first set -- (nth 0 my-data) -- versus
each of the other four sets and find out which string tokens they have
in common.

In this example, the answer for the first set -- (nth 0 my-data) -- is:

(NIL ; ignore itself
NIL
("roof") ; "roof" is common to both (nth 0) and (nth 2)
("Pillows") ; "pillows" '' ''
(nth 3)
NIL)

And so on for (nth 1 my-data), (nth 2 my-data), etc.

I've been using the (intersection) function wrapped inside a mapcar
iterator for this -- the real data, though different in content from
the example above, is structured exactly the same way: a list of lists
of strings.

Although it's logically correct, it's slow (the real data is quite
large), and so I've been thinking about ways of improving its
efficiency.

For example, instead of lists of strings, using hash tables, i.e.
iterate each hash, and look for where the keys (string) match, since
hash lookups are fast.

Also, I was wondering about more effective ways of storing large
amounts of string data in memory in the first place, e.g. perhaps as
lower-level arrays of chars or bits, but I'm not quite sure how to go
about that efficiently.

Any suggestions?

.



Relevant Pages

  • Re: How come Ada isnt more popular?
    ... are praising is, because what about trees of strings, trees of lists etc. ... a language also shouldn't have strings ... but there are no generics instantiated. ...
    (comp.lang.ada)
  • Re: How come Ada isnt more popular?
    ... are praising is, because what about trees of strings, trees of lists etc. ... the language should not have either X or Y built-in. ... but there are no generics instantiated. ...
    (comp.lang.ada)
  • Re: Newb looking for data binding help
    ... lists to store the strings, ... Two strings, one for a system name and one for equipment name ... When the user clicks on an 'equipment' node, I'd like to display just ...
    (microsoft.public.dotnet.languages.vb)
  • Bottleneck rule
    ... groups similar strings together. ... prints out an intermediate store file to view intermediate results. ... (setf (aref store i 0) ... ;;Print output the lists of grouped strings and counts. ...
    (comp.lang.lisp)
  • Re: comments on my design of a new language?
    ... > exist, plus strings and the function type, and that is it. ... indexed by a string or a regular expression, ... > dictionaries, that terminate with literals, or (possibly ... and lists. ...
    (comp.lang.misc)