Re: read all available pages on a Website

From: Carlos Ribeiro (carribeiro_at_gmail.com)
Date: 09/13/04


Date: Mon, 13 Sep 2004 10:24:11 -0300
To: bradtilley@usa.net

Brad,

Just to clarify something other posters have said. Automatic crawling
of websites is not welcome primarily because of performance concerns.
It also may be regarded by some webmasters a a kind of abuse, because
the crawler is doing 'hits' and copying material for unknown reasons,
but is not seeing any ad or generating revenue. Some sites even go to
the extent of blocking access from your IP, or even for your entire IP
range, when they detect this type of behavior. Because of this, there
is a very simple procol involving a file called "robots.txt". Whenever
your robot first enter into a site, it must check this file and follow
the instructions there. It will tell you what you can do in that
website.

There are also other few catches that you need to be aware of. First,
some sites don't have links pointing to all their pages, so it's never
possible to be completely sure about having read *all* pages. Also,
some sites have link embedded into scripts. It's not a recommended
practice, but it's common at some sites, and it may cause you
problems. And finally, there are situations where your robot may be
stuck into an "infinite site"; that's because some sites generate
pages dinamically, and your robot may end up fetching page after page
and never get out of the site. So, if you want a generic solution to
crawl any site you desire, you have to check out these issues.

Best regards,

-- 
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: carribeiro@gmail.com
mail: carribeiro@yahoo.com


Relevant Pages

  • Re: how to connect to internet using router
    ... websites in regards to playing back either audio or video. ... offer either Windows Media Player or QuickTime, ... And they did make a mess out of the QuickTime version as well. ... they've actually taken a step backwards for some reason in regards to fonts... ...
    (comp.unix.solaris)
  • Re: how do you blow up the robot
    ... document to print the way you wanted, stuff it into the robot, and set it on ... (clearly shown in Star Trek, Terminator, etc.), so it will blow up. ... Regards, ... Jay Freedman ...
    (microsoft.public.word.newusers)