Re: search engine challenge

From: Philipp Lenssen (info_at_outer-court.com)
Date: 01/26/04


Date: 26 Jan 2004 13:55:14 GMT

Frank wrote:

>
> I'm running a site with +20.000 articles. The articles (html files)
> are saved on the server as txt files. Alle other data (author, date,
> category and so on) are in a MySQL db. Before we had the articles put
> in the db also and then performed SQL queries for the search engine.
> But this is no longer feasable since there are too many articles and
> the db has gotten too big. The search engine does all of the db and
> the server cpu goes max. I'm looking for a php type search engine
> that automatically indexes the txt files, produces 1 index file with
> all indexed words + the id's of articles having those words. Like
> that the search script doesn't have to query all the articles (the
> whole db) anymore but just this one index file. Would be nice also if
> there would be possibility to have a blacklist of words (the, a,...)
> and other admin things.
>

If the site is public, have you thought about letting Google do the
hard work, and then either using the Google site search, or the Google
Web API to display results? Google is getting _very_ fast in indexing
large amounts of data on one's site. They picked up thousands of my
pages recently while I was playing around with the htaccess... even too
fast for my taste since I changed it again the next day...

-- 
Google Blogoscoped
http://blog.outer-court.com


Relevant Pages

  • Re: [MODERATOR NOTICE] Google/Gmail filter off
    ... I think the fact that you turned the filter off has once again, ... All articles submitted to talk.origins, because it is moderated, ... and then are sent via NNTP to Google. ... visible to any other server. ...
    (talk.origins)
  • Re: [MODERATOR NOTICE] Google/Gmail filter off
    ... These are the types of posting problems you have when you try to ... All articles submitted to talk.origins, because it is moderated, ... and then are sent via NNTP to Google. ... visible to any other server. ...
    (talk.origins)
  • Re: how to explain these logs?
    ... >Also, in some place of the log file, I see these 2 lines where the ... Someone who oriented privacy at 210.21.30.169 searched proxy server. ... Someone or search engine from 216.35.116.91 wanted to get robots.txt ... Google "robots.txt". ...
    (comp.os.linux.security)
  • Re: [MODERATOR NOTICE] Google/Gmail filter off
    ... if it's really Google we shall see. ... /sync a moderated server with hundreds of other public news servers ... At no time is any metadata copied, the only data sent are the articles ... email to oshea dot j dot j at gmail dot com. ...
    (talk.origins)
  • Re: [MODERATOR NOTICE] Google/Gmail filter off
    ... All articles submitted to talk.origins, because it is moderated, ... and then are sent via NNTP to Google. ... visible to any other server. ... You are still part of a network despite the protocol being used. ...
    (talk.origins)