Re: Data structure behind Google Suggest
- From: "Paul E. Black" <p.black@xxxxxxx>
- Date: Wed, 26 Sep 2007 14:02:23 -0400
On Monday 24 September 2007 22:38, Chris wrote:
Can anyone think of an efficient data structure that could handle Google
Suggest?
Google Suggest, if you haven't seen it, shows the top 10 searches that
relate to the letters that you type into the search box. It operates as
you type, so if you type "a", it will give you the top 10 searches that
start with "a", "ab", the top ten that start with that, etc.
It goes down to an amazing level of detail, and shows even extremely
obscure searches after you've typed two dozen characters. There has got
to be a billion or more search phrases in their database.
How could you do this using a reasonable amount of hardware? ...
First Google *has* a huge amount of hardware, cpu and disk, to answer
queries, so that's not an issue.
For anything like this, it is a well-engineered solution, not just one
data structure. They tried some things, found the bottlenecks, and
improved it. Here's where I'd start.
Dedicate a few dozen (or more) machines to Suggest. Load each machine
with the "answer" to all 3 letter/number combinations:
[a-z0-9][a-z0-9 ][a-z0-9 ] = 35964 * 10 match strings * 20
characters/string = 7.2 million characters. Simple compression
(e.g. Patricia tree) cuts it to about a million. Heavens, that fits
in processors' first level *cache*, let allow their main memory. So
now you can respond to the first 3 characters instantly.
Each of these can point to a larger database specific to each prefix,
as you suggest.
While you're casually typing in two or three characters, it begins to
retrieve the roots of trees or tables corresponding to those. (lessee,
1/4 second = 250,000 microseconds - TONS of time to fire off a query
to main memory and even enough to begin getting a response from disk.)
So it can service say all 5 or 6 characters inquiries without a
sweat. If you keep the unusual strings, like "x 5tq", in the dusty
attic, its probably not even too many.
Now the fun begins. It has 6 or 7 characters. It can check with the
query look aside buffers for detailed results.
Sicnerely,
-paul-
Paul E. Black (p.black@xxxxxxx)
.
- References:
- Data structure behind Google Suggest
- From: Chris
- Data structure behind Google Suggest
- Prev by Date: Mathematical models for loop calculations
- Next by Date: Re: Mathematical models for loop calculations
- Previous by thread: Data structure behind Google Suggest
- Next by thread: Re: Data structure behind Google Suggest
- Index(es):
Relevant Pages
|