Re: Boolean Query Algorithm
- From: Chris <spam_me_not@xxxxxxxxxx>
- Date: Fri, 14 Jul 2006 11:46:39 -0500
Sherrie Laraurens wrote:
Hi all,
I have a question relating to how search engines and (in fact
anything else that supports boolean queries) manage to do such
things so efficiently.
my question involves a query were you would like to retrieve the all
the documents in your corpus that have the words "Cat" and "Bat" in
them and that they not only contain both words but that they must be
consecutive for example "Bat Cat"
I can think of a very crude way of doing this which involves hashing
every word in a document into a hash table and storing an index for
said document , then in the query stage to hash both words (hash
join) get the intersection vector of the resulting vectors from the
hashing process, then one by one examine each document from the
intersection vector find the word "Bat" and see if the word "Cat" is
the next word if it is place said document into the final result
vector and once finished pass back to user.
I believe this will work for AND and OR type queries but I can't
imagine systems like google or yahoo using such a CRUDE method nor
can i see them caching such results, just because the sheer amount
of combinations things they would have to be caching.
Does anyone here have any idea on how these things are done? any
keywords I could use to google etc..
Sherrie
Believe it or not, your algorithm isn't far from the truth. Words are stored in btrees, not hash tables. Postings (doc ids and word positions) are stored in a highly compressed form. The magic is all in the query evaluation strategy, and in how the engine avoids having to look at most of the postings list in order to find the top 10 documents.
.
- References:
- Boolean Query Algorithm
- From: Sherrie Laraurens
- Boolean Query Algorithm
- Prev by Date: Re: Military logistic problem
- Next by Date: Re: Mapping rationals to binary strings while preserving order
- Previous by thread: Re: Boolean Query Algorithm
- Next by thread: yet another problem
- Index(es):
Relevant Pages
|