Large-scale full-text search: how to get the intersection list fast?
- From: yaoziyuan@xxxxxxxxx
- Date: 13 Feb 2006 16:05:59 -0800
The tale says full-text search engines precompute a list of pages which
contain a certain keyword. So, it's very easy to return a list of
results if only one keyword is submitted.
However, if more than one keyword are submitted for results containing
all of these keywords, how do you quickly get the intersection list of
several very long page-lists each led by one keyword?
In Google:
cat: 423,000,000
dog: 334,000,000
cat dog: 56,200,000
Information theory says the intersection time is linear to the length
of the shorter of the two lists to intersect.
My guess is that there is a hash array PagesForTwoWords to store
page-lists which contain a certain word pair. For example, given a
search query [cat dog], a hash value H is generated based on the string
"cat*dog". Then we go to PagesForTwoWords[H] for a pointer to a list of
pages which contain both "cat" and "dog".
This guess is based on the claim that "time is precious but storage is
cheap". Does anyone know the real or a better approach?
Regards,
Yao Ziyuan
.
- Follow-Ups:
- Prev by Date: Make $$Thousands$$ using PayPal and $6
- Next by Date: Re: Large-scale full-text search: how to get the intersection list fast?
- Previous by thread: Make $$Thousands$$ using PayPal and $6
- Next by thread: Re: Large-scale full-text search: how to get the intersection list fast?
- Index(es):
Relevant Pages
|