Re: Boolean Query Algorithm
- From: "Sherrie Laraurens" <sherrielaraurens@xxxxxxxxxxx>
- Date: 12 Jul 2006 02:48:03 -0700
Hi Herbert,
I can't imagine anyone placing information such as document id and
position in document of words for each unique word encountered in the
initial pre-processing of ones corpus of data - which could literally
contain 100's of gig of data - it simply just sounds like a very
stupid way of going about it as is my suggestion too. hence assume
you store the document id which is a 4 byte unsigned int and a
position value which will be a 4 byte unsigned int, thats a total of
8 bytes per occurance of a word (text pattern), and this is not even
taking into account partial matches etc..
As for the GFS I don't believe it is responsible for the querying but
rather for providing infrastructure (distributed access etc) for the
higher level algorithms to do whatever it is that they do, which
really was my question. also as far as how google does its business
ie: map&reduce, "shards" based distribution etc, google is very very
sketchy on the details, anything that they put out can be found
easily in any simple algorithms book.
Sherrie
Herbert Glarner wrote:
Hashes are no good idea in this context. If you look for whole words
only, you want to use "Patricia Tries". If you want to find also partial
strings, look at Suffix Trees in general.
Store positions and lengths with the nodes and/or leaves, and when
searching for combinations, check if P1+L1 is close enough to P2,
allowing for delimiters like whitespace etc.
As for Google: They developed an own Google File System, it's pretty
well documented somewhere on the net. The search algorithms themselves
are, as far as I know, business secret.
Regards
//Herbert
--
http://herbert.wikispaces.com
.
- Follow-Ups:
- Re: Boolean Query Algorithm
- From: Joachim Pimiskern
- Re: Boolean Query Algorithm
- References:
- Boolean Query Algorithm
- From: Sherrie Laraurens
- Re: Boolean Query Algorithm
- From: Herbert Glarner
- Boolean Query Algorithm
- Prev by Date: Re: Boolean Query Algorithm
- Next by Date: Re: Two-dimensional pattern matching/compression
- Previous by thread: Re: Boolean Query Algorithm
- Next by thread: Re: Boolean Query Algorithm
- Index(es):
Relevant Pages
|