Re: Data structure behind Google Suggest



On Monday 24 September 2007 22:38, Chris wrote:

Can anyone think of an efficient data structure that could handle Google
Suggest?

Google Suggest, if you haven't seen it, shows the top 10 searches that
relate to the letters that you type into the search box. It operates as
you type, so if you type "a", it will give you the top 10 searches that
start with "a", "ab", the top ten that start with that, etc.

It goes down to an amazing level of detail, and shows even extremely
obscure searches after you've typed two dozen characters. There has got
to be a billion or more search phrases in their database.

How could you do this using a reasonable amount of hardware? ...

First Google *has* a huge amount of hardware, cpu and disk, to answer
queries, so that's not an issue.

For anything like this, it is a well-engineered solution, not just one
data structure. They tried some things, found the bottlenecks, and
improved it. Here's where I'd start.

Dedicate a few dozen (or more) machines to Suggest. Load each machine
with the "answer" to all 3 letter/number combinations:
[a-z0-9][a-z0-9 ][a-z0-9 ] = 35964 * 10 match strings * 20
characters/string = 7.2 million characters. Simple compression
(e.g. Patricia tree) cuts it to about a million. Heavens, that fits
in processors' first level *cache*, let allow their main memory. So
now you can respond to the first 3 characters instantly.

Each of these can point to a larger database specific to each prefix,
as you suggest.

While you're casually typing in two or three characters, it begins to
retrieve the roots of trees or tables corresponding to those. (lessee,
1/4 second = 250,000 microseconds - TONS of time to fire off a query
to main memory and even enough to begin getting a response from disk.)
So it can service say all 5 or 6 characters inquiries without a
sweat. If you keep the unusual strings, like "x 5tq", in the dusty
attic, its probably not even too many.

Now the fun begins. It has 6 or 7 characters. It can check with the
query look aside buffers for detailed results.

Sicnerely,
-paul-
Paul E. Black (p.black@xxxxxxx)
.



Relevant Pages

  • Re: Concatinate a filename
    ... characters, this won't retrieve all the text. ... >> But google is adding extra characters in the code and screwing it up. ... >> Function pull(xref As String) As Variant ... >> CLR wrote: ...
    (microsoft.public.excel.misc)
  • Re: [Full-disclosure] What Lexical Analysis Became in The Web-Slave New World
    ... Well, check this out, there's this Google ... stating that their BETA releases represent a new web-based BETA ... Two of their free BETA services, Google Calendar and Orkut, are ... services' handling for those characters can cause users' data to be ...
    (Full-Disclosure)
  • Re: [Full-disclosure] What Lexical Analysis Became in The Web-Slave New World
    ... you are a Google Calendar, ... programming laziness, also known as XP or Agile methodology. ... Two of their free BETA services, Google Calendar and Orkut, are ... services' handling for those characters can cause users' data to be ...
    (Full-Disclosure)
  • [Full-disclosure] What Lexical Analysis Became in The Web-Slave New World
    ... Well, check this out, there's this Google corporation ... stating that their BETA releases represent a new web-based BETA ... Two of their free BETA services, Google Calendar and Orkut, are going ... services' handling for those characters can cause users' data to be ...
    (Full-Disclosure)
  • Re: Oyakata = Okata?
    ... contain the necessary characters for displaying Japanese. ... Ideally Google would default to one of the Unicode ... encodings. ...
    (sci.lang.japan)