Re: How to get all the files under the webdir?



On 2/13/07, Jm lists <practicalperl@xxxxxxxxx> wrote:

I want to get all the files on some a webdir.For example:

What do you mean by "some a webdir"?

http://www.foo.com/bar/

But that dir has a default page "index.htm".So when I accessed the url
I only got the default page.

Can you tell me is there a way to fetch all the files in that dir?Thanks a lot.

Are you asking, is there a way to ask a remote webserver for
something? That would be a question about webservers, not about Perl,
wouldn't it? And the answer, probably, is that webservers don't offer
an easy way to let remote users download their entire contents, for
much the same reason that the all-you-can-eat restaurant doesn't
deliver the kitchen's entire output to your table when you arrive.

If you're looking to slurp down an entire remote site, or even a
sizable portion of one, well, that's just rude. You're abusing the
hospitality of the information provider. Unless you have a good
reason, of course; you might be the next Google, for all I know. Let's
say you are.

But are you, like Google, dealing with more than a dozen other sites?
It's not practical for Google to contact the owners of the information
in millions of separate cases; but if you've only got a few (or just
one?) site on your list, there's no way around it: The only polite way
to get the information is to ask for it.

Why is it polite? Remember, you're consuming some of the site's
outgoing bandwidth, and they pay for that. If you ask nicely, the
information's owner may send you a CD and save you *both* time and
trouble. (Or, maybe, the information's owner may not want you to have
the entire fileset; in which case taking it anyway is even more rude.)

For the sake of argument, then, let's say you've gotten this far and
you still need a program that will fetch things for you. Let's say
you've even read the Web Robots FAQ, so you know that a good robot
won't overload the server, for example:

http://www.robotstxt.org/wc/faq.html

Sure; Perl can do web robots. Have you looked on CPAN?

http://search.cpan.org/search?query=RobotRules&mode=all
http://search.cpan.org

Hope this helps!

--Tom Phoenix
Stonehenge Perl Training
.



Relevant Pages

  • Re: How to get all the files under the webdir?
    ... If you "just" want to download index.html and follow links and nothing ... But are you, like Google, dealing with more than a dozen other sites? ... (Or, maybe, the information's owner may not want you to have ... Perl can do web robots. ...
    (perl.beginners)
  • SOLVED: "Text File Busy" And Other Frustrating NFS/Perl Errors
    ... Perl file on one computer, ... and I am running it on another (my server). ... The first problem I had was that when I saved my script through NFS, ... searched newsgroups on Usenet and mailing lists and general Google searches ...
    (comp.lang.perl.misc)
  • Re: finding perl info on google can be hard
    ... Please alow me to point out that I am using google to ... I am also using it to read various other newsgroups. ... the perl docs were all anyone needed to learn about perl these ... The perldocs are a standard part of the Perl distribution. ...
    (comp.lang.perl.misc)
  • Re: Buddhism in Japan
    ... Say you were accessing TRB via Google Groups... ... legal rights of others; ... owner to Post such Content; ... While Google prohibits such conduct and Content in connection with the ...
    (talk.religion.buddhism)
  • Re: Buddhism in Japan
    ... Say you were accessing TRB via Google Groups... ... legal rights of others; ... owner to Post such Content; ... While Google prohibits such conduct and Content in connection with the ...
    (talk.religion.buddhism)