Re: Good search theory

nospam_at_geniegate.com
Date: 03/16/05


Date: Wed, 16 Mar 2005 05:54:49 GMT

In: <1110915682.132517.134550@l41g2000cwc.googlegroups.com>, "AaronV" <aaron.vanderpoel@gmail.com> wrote:
>Hello,
>
>I'm a webmaster for a college newspaper and I'm implementing an article
>search. I'm running PHP with a MySQL database to store the weekly
>stories. Does anyone know of an article that could offer good search
>theory.

If it's an option for you, have a look at swish-e

http://swish-e.org/index.html

I don't know if there is a PHP interface or not though. It's semi-difficult to
set up, but the folks who wrote it really did a good job. There are all kinds
of ways of setting up Swish-e for META tags and the like.

Proximity and phrases are quite difficult, tricky stuff but swish-e handles
them.

If swish-e won't work another option might be Lucene:

http://lucene.apache.org/java/docs/

Been a few years, but when I checked into it Lucene was quite good as well.
It's java, which may be an issue if you're not already running servlets.
Surprisingly fast, especially considering it's java.

Another option is Ht://dig

http://htdig.org/

Last I checked, it didn't do phrase matching, but it's quite mature. Been
around a long time, several people are using it. It's the easiest one I've
seen where setup is concerned. If you don't require phrase match, it's pretty
decent.

All of them that I've listed use an index and are pretty good at scale.
Wouldn't try to use them in place of teoma.com, (With the possible exception of
multiple Lucene's) but I bet they would work well for your application.

One could probably fill a small library (or at least a full section of a
library) with books on the subject of searching full text. 'tis not an easy
task.

>Seems like there are a lot of choices in how to set up a good search
>system and I'd like to get started on the right foot to reduce my work
>load.

Maybe I'm prejudiced, but in my opinion SQL databases are not really designed
for searching full text. (Been awhile, but I've been burned by them for
fulltext search in the past) I suppose for a few hundred articles and/or
highly custom search tools, an SQL database would work. (If your articles are
in XML, then such a database would be OK for searching in titles or maybe within
pre-determined XML containers like <var>..</var>)

The "issue" I take with them is that you are effectively using a database
AS an index. A database's primary goal is (or should be) data storage. Fulltext
indices are a different beast altogether.

They are excellent for setting up prototype "proof of concept" but quickly
break down when using them for larger quantities of data. (This opinion based
on a context-aware search tool, done in 1999, 6 years is a long time and things
may have changed.)

They do make good URL storage devices, last index time, things like that.

Jamie

-- 
http://www.geniegate.com                    Custom web programming
guhzo_42@lnubb.pbz (rot13)                User Management Solutions


Relevant Pages

  • Re: The crazy encryption madmans codebook
    ... encoded to any word in the database. ... Why would it not be possible, the offset number is just an integer ... where each entry have an index, realworld word or phrase and a madman ... would call *offset keys* an offset key is used to encode *one and just ...
    (sci.crypt)
  • Re: The crazy encryption madmans codebook
    ... Suppose database ranging 0-5 000 000 indexed word and phrases ... where each entry have an index, realworld word or phrase and a madman ... letters" when used off course you could use a hash algorithm that put ... discarding almost all of those resulting in an approximate entropy of 20-25 ...
    (sci.crypt)
  • Re: How would I implement this "remove me" feature in PHP 4?
    ... > I am running PHP 4. ... So you send a form saying they must fill in their email address. ... see if there is indeed an entry in the database with that email address. ... If successful, we then do a delete statement, and return a message page. ...
    (alt.php)
  • Re: Newbie - Are You Sure Thats the Correct Pass Phrase?
    ... The pass phrase is run through PKCS # 5 algo 2. ... produces the crypto key for certain columns in database accesses. ... So the passphrase is not ... and a hash to generate the encryption key. ...
    (sci.crypt)
  • Re: Newbie Salt and Pass Phrase Question.
    ... Just start the program, enter your pass phrase, ... just one username). ... That would rely on program code to enforce the security. ... could link to your database tables externally, ...
    (sci.crypt)