Re: Searching 40 Million short strings fast



M.Barren@xxxxxxxxx wrote:
> I am writing a program that queries a website and extracts certain
> information from result pages and writes them in a file. These
> informations are short strings of ascii characters (between 10 to 12
> chars) and number of them will eventually be around 40 million. There
> is a little problem though. Since the searches don't neccersily return
> unique results, I might get certain results over and over again and if
> I add them to my file, I will have duplicates which due to the nature
> of search mechanism of that certain website, will double or even triple
> the amount of collected data.
>
> I want to be able to check if I already have a certain string in my
> file very quickly. Indexing strings happens to be one of the only (if
> not the only) way of doing this. But I don't want to use a whole
> database server for a temporary data collection stage of this program.
> SqLite is an interesting choice but I still rather use even a simpler
> method that I can myself implement that is light-wight, fast and
> specilized *just* to do what i want.
>
> If you have any ideas, I'll be more than happy to hear it.

Well, obviously you want a hash table; one that can hold 40 million
entries of course. James Dow Allen has something that is likely to
work for this purpose:

http://freepages.genealogy.rootsweb.com/~jamesdow/Tech/mhash.htm

Though the code is a bit messy. As I recall, that solution is not
thread safe, and can only exist as one instance. I don't know if that
matters to you.

Then there is the question of hashing a string, versus hashing an
encoding of the strings. For example, if the strings themselves
average greater than 20 characters in length, then you could try
encoding them using SHA-1, and in fact use a prexfix of the SHA-1 value
as the hash key (this is sound), and just store the remaining bytes in
the hash entries. This may have the advantage of being better for
memory consumption. (It has the theoretical problem of false mappings,
but to date, there are no known aliased entries with respect to SHA-1.)

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: Problems with writing HMAC-SHA1
    ... <The SHA routine puts 5 32-bit values into CryptoMessageas Strings. ... The key must be padded to an entire SHA-1 ... translation from 32-bit ints to bytes. ... Greg Rose ...
    (sci.crypt)
  • Re: Poker hand evaluator
    ... Objects with strings as access keys, ... It is no longer ~49k entries, by the way, rather 55 - I have added the ... suit, and O for any rank without a card in the suit. ...
    (comp.lang.javascript)
  • Re: Holub on getters/setters again
    ... >> Note that even if we return a clone of an array list, if the entries in ... allows overloading of the equality operator, ... don't think the Add method would allow two strings with the same state. ... Daniel Parker ...
    (comp.object)
  • Re: Searching 40 Million short strings fast
    ... I am writing a program that queries a website and extracts certain ... Indexing strings happens to be one of the only (if ... database server for a temporary data collection stage of this program. ... representation, which is often smaller than the data it represents. ...
    (comp.programming)
  • Re: Why does it take up so much memory
    ... Next do you actually know the total length of the strings you are reading into ... lot of heap fragmentation and other problems with memory allocation if you ... The data from this will occupy 25 bytes on disk and the whole thing will only ... Even with a data file of 30,000 entries your string data will occupy exactly ...
    (comp.lang.pascal.delphi.misc)