Re: Searching 40 Million short strings fast



M.Barren@xxxxxxxxx wrote:
Hi,

I am writing a program that queries a website and extracts certain
information from result pages and writes them in a file. These
informations are short strings of ascii characters (between 10 to 12
chars) and number of them will eventually be around 40 million. There
is a little problem though. Since the searches don't neccersily return
unique results, I might get certain results over and over again and if
I add them to my file, I will have duplicates which due to the nature
of search mechanism of that certain website, will double or even triple
the amount of collected data.

I want to be able to check if I already have a certain string in my
file very quickly. Indexing strings happens to be one of the only (if
not the only) way of doing this. But I don't want to use a whole
database server for a temporary data collection stage of this program.
SqLite is an interesting choice but I still rather use even a simpler
method that I can myself implement that is light-wight,  fast and
specilized *just* to do what i want.

If you have any ideas, I'll be more than happy to hear it.
Michael


Take a look at tries. A trie has the advantage of very fast lookups (linear in the length of the search string) and a very compact representation, which is often smaller than the data it represents. I've got a text file of 173,840 words that's 1.75 MB; my trie representation of the same data is 1.15 MB.


-Peter

--
Pull out a splinter to reply.
.



Relevant Pages

  • Re: Changing Unicode object to Tuple Type
    ... Now I wanted to use pyRXPU to get all ... I don't know pyRXP and pyRXPU, and especially not how you use them. ... No specialized Python classes are used in the representation -- just ... tuples, dicts, lists, and strings. ...
    (comp.lang.python)
  • Re: question on structs and memory blocks
    ... >>MyStruct and in fact it is better to do so since sizeofincludes any ... >>see a strlen in there your structure does not contain the strings ... >>but merely pointers to them, it is the cost of these pointers that ... the two machines must agree on a common external representation ...
    (comp.lang.c)
  • Re: infinity
    ... there is provably no bijection between A and P ... Just consider bit strings consisting of one bit in each ... >> The same objection ... denominator has only factors of the radix has dual representation. ...
    (sci.math)
  • Re: Maintainable compiler design?
    ... solidly organize the compiler to ensure that it is easy to maintain. ... code generation is done directly by emitting strings by walking the ... consider defining a set of conventions for things like types representation, ... wodering why the hell said project's JIT internals (or, ...
    (comp.compilers)
  • Re: Terminology question: are s-exps the text or the data or both?
    ... Those "special characters" and "strings of capital Latin letters and digits ... alternate representation is the Lisp objects produced by the reader. ...
    (comp.lang.lisp)