Re: read all available pages on a Website
From: Carlos Ribeiro (carribeiro_at_gmail.com)
Date: 09/13/04
- Next message: Thorsten Kampe: "Re: Python or 4NT? With a question or two about popen()"
- Previous message: Marc Jeurissen: "Re: Property with parameter..."
- In reply to: Brad Tilley: "read all available pages on a Website"
- Next in thread: Michael Foord: "Re: read all available pages on a Website"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Mon, 13 Sep 2004 10:24:11 -0300 To: bradtilley@usa.net
Brad,
Just to clarify something other posters have said. Automatic crawling
of websites is not welcome primarily because of performance concerns.
It also may be regarded by some webmasters a a kind of abuse, because
the crawler is doing 'hits' and copying material for unknown reasons,
but is not seeing any ad or generating revenue. Some sites even go to
the extent of blocking access from your IP, or even for your entire IP
range, when they detect this type of behavior. Because of this, there
is a very simple procol involving a file called "robots.txt". Whenever
your robot first enter into a site, it must check this file and follow
the instructions there. It will tell you what you can do in that
website.
There are also other few catches that you need to be aware of. First,
some sites don't have links pointing to all their pages, so it's never
possible to be completely sure about having read *all* pages. Also,
some sites have link embedded into scripts. It's not a recommended
practice, but it's common at some sites, and it may cause you
problems. And finally, there are situations where your robot may be
stuck into an "infinite site"; that's because some sites generate
pages dinamically, and your robot may end up fetching page after page
and never get out of the site. So, if you want a generic solution to
crawl any site you desire, you have to check out these issues.
Best regards,
-- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: carribeiro@gmail.com mail: carribeiro@yahoo.com
- Next message: Thorsten Kampe: "Re: Python or 4NT? With a question or two about popen()"
- Previous message: Marc Jeurissen: "Re: Property with parameter..."
- In reply to: Brad Tilley: "read all available pages on a Website"
- Next in thread: Michael Foord: "Re: read all available pages on a Website"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|