Re: Boolean Query Algorithm



Hi Herbert,

I can't imagine anyone placing information such as document id and
position in document of words for each unique word encountered in the
initial pre-processing of ones corpus of data - which could literally
contain 100's of gig of data - it simply just sounds like a very
stupid way of going about it as is my suggestion too. hence assume
you store the document id which is a 4 byte unsigned int and a
position value which will be a 4 byte unsigned int, thats a total of
8 bytes per occurance of a word (text pattern), and this is not even
taking into account partial matches etc..

As for the GFS I don't believe it is responsible for the querying but
rather for providing infrastructure (distributed access etc) for the
higher level algorithms to do whatever it is that they do, which
really was my question. also as far as how google does its business
ie: map&reduce, "shards" based distribution etc, google is very very
sketchy on the details, anything that they put out can be found
easily in any simple algorithms book.


Sherrie



Herbert Glarner wrote:
Hashes are no good idea in this context. If you look for whole words
only, you want to use "Patricia Tries". If you want to find also partial
strings, look at Suffix Trees in general.

Store positions and lengths with the nodes and/or leaves, and when
searching for combinations, check if P1+L1 is close enough to P2,
allowing for delimiters like whitespace etc.

As for Google: They developed an own Google File System, it's pretty
well documented somewhere on the net. The search algorithms themselves
are, as far as I know, business secret.

Regards
//Herbert

--
http://herbert.wikispaces.com

.



Relevant Pages

  • Re: Is there a known algorithm for this?
    ... I have the Y-axis pixel locations of lines of ... lines of text on a page (which are approximations [I.e., ... description well enough to find it by Google. ... provide a pointer to suitable algorithms. ...
    (comp.lang.java.programmer)
  • Re: Wrote a little encryption program. How can you tell how good it is?
    ... I am sure that you mean no harm, but folks here ... > your algo, and it is much stronger than your algo. ... > tends to be much easier to find weaknesses in the weakest algorithms. ... > and start looking into it with Google. ...
    (sci.crypt)
  • Re: Deczky all-pass filter design Matlab
    ... Like Google is doing? ... Minister of Algorithms ... distribution of cataloged documents may be practical. ...
    (comp.dsp)
  • Re: quick newbie question
    ... > Artie Gold wrote: ... it sure as hell cannot compare with the direction possible ... By the way, searching the web (i.e., Google :-), with various queries ... other algorithms that are also O) (and even a link on the O ...
    (comp.lang.lisp)
  • Re: How to dissect algorithms.
    ... I know that we cover basic sort and ... search algorithms later on, but from glancing ahead in the text, it's ... > Maybe you should take some time off from watching 61A webcasts to watch ... > it and he's big on formal analysis of algorithms. ...
    (comp.lang.scheme)