Re: Searching 40 Million short strings fast
- From: Peter Ammon <gershwin@xxxxxxxxxxxxxxx>
- Date: Mon, 05 Sep 2005 22:56:45 -0700
M.Barren@xxxxxxxxx wrote:
Hi,
I am writing a program that queries a website and extracts certain information from result pages and writes them in a file. These informations are short strings of ascii characters (between 10 to 12 chars) and number of them will eventually be around 40 million. There is a little problem though. Since the searches don't neccersily return unique results, I might get certain results over and over again and if I add them to my file, I will have duplicates which due to the nature of search mechanism of that certain website, will double or even triple the amount of collected data.
I want to be able to check if I already have a certain string in my file very quickly. Indexing strings happens to be one of the only (if not the only) way of doing this. But I don't want to use a whole database server for a temporary data collection stage of this program. SqLite is an interesting choice but I still rather use even a simpler method that I can myself implement that is light-wight, fast and specilized *just* to do what i want.
If you have any ideas, I'll be more than happy to hear it. Michael
Take a look at tries. A trie has the advantage of very fast lookups (linear in the length of the search string) and a very compact representation, which is often smaller than the data it represents. I've got a text file of 173,840 words that's 1.75 MB; my trie representation of the same data is 1.15 MB.
-Peter
-- Pull out a splinter to reply. .
- Follow-Ups:
- Re: Searching 40 Million short strings fast
- From: M . Barren
- Re: Searching 40 Million short strings fast
- References:
- Searching 40 Million short strings fast
- From: M . Barren
- Searching 40 Million short strings fast
- Prev by Date: Re: i need help
- Next by Date: Perl and other programming language
- Previous by thread: Re: Searching 40 Million short strings fast
- Next by thread: Re: Searching 40 Million short strings fast
- Index(es):
Relevant Pages
|