Re: related pages ...




"Animesh K" <animesh1978@xxxxxxxxx> wrote in message
news:fh0c8g$1u1s$1@xxxxxxxxxxxxxxxxxxxxx
Jerry Stuckle wrote:
Animesh K wrote:
Jerry Stuckle wrote:



That's not easy. Are you keeping the articles in a database or text
files? If the former, you can search the database.


To keep the problem simpler, let's assume that each article has
tags/authors/topic and it is stored in database.

One can view this as a graph-theoretic problem (with the graph being
computed by php and stored in a database). But doing it in an efficient
way would be interesting.

Scanning the article for probable keywords is the next (and much harder)
step :)

why not approach it on a curve rather than a linear graph, without defining
(manually) specific words that should identify the page. that may be what
you're talking about anyway. in that case, it should be less of a big deal
than you think.

if you parse the text of the pages, exclude common words (like adjectives,
articles, and verbs), and reduce the page content to nouns essentially, you
can then give rank to each one based on occurance. you could also assist
yourself in this process by creating a mapping table. in that table, you
could define certain jargon that will be found in, or unique to, your site.
that would better correlate the ranking that i just described. you could
also define the rank in other ways like the 'common-ness' of the words left
in the reduced content. 'theory' is not a very common term in most settings,
so, it may need to be seen as a more predominate descriptor of what the page
is about. make sense?

that's a content based way to rank similarities between pages. as for tags,
authors, and topics? well, that's pretty specific and less guessing has to
be done.

anyway, that's just an initial theoric approach to retaining abstractness
without having to know what any one page is about - requiring you to read
the page and manually creating the relationships, i mean.

what would also be helpful for you to do is to look at case studies done by
web crawlers and search engines. there have been a terrible amount written
about what google is doing that makes them so successful compared to others.
i mean specific tactics and algorythms they use...not just conceptual stuff.
ironically, you can find these by googling google. :)

hth,

me


.



Relevant Pages

  • Re: Fields?
    ... people to send data into my database. ... to yet in Dain's set of articles is exactly HOW you get the data to move ... from the Word document into the Access table so I am working in the dark on ... >> So, Suzanne, if you or anyone else is still with me after this long, ...
    (microsoft.public.word.formatting.longdocs)
  • Re: Looking for suggestion on how to organize a chest of photos, slides , negatives spanning 30 year
    ... as most articles I've read some articles that suggest 3 ring binders and ... special acid free materials. ... Three-ring binders can be a problem. ... I rolled my own database with a database program. ...
    (rec.photo.equipment.35mm)
  • Re: Updating the SQL key value
    ... author's login name from our articles and users tables: ... have a vary poorly normalized database. ... I think you're making a bad assumption there, Toby. ...
    (comp.lang.php)
  • Re: Spreadsheet vs Database for certain situations
    ... there's no point in copying articles from 10-year-old magazines. ... Is it your friend who needs convincing or the boss? ... building work arounds until maintaining the spreadsheet is more time ... a database - people ...
    (comp.databases.paradox)
  • Re: Want to write your SQL statements and even stored procedures in pure C#?
    ... I'll be writing more articles soon. ... > your reasoning is that you claim your solution is truly database ... SQL server, DB2, and Firebird. ... Second of all - sequences are not horrible, identity fields are the item that is poor. ...
    (microsoft.public.dotnet.framework.adonet)