FAQ 9.5 How do I extract URLs?



This is an excerpt from the latest version perlfaq9.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

--------------------------------------------------------------------

9.5: How do I extract URLs?

You can easily extract all sorts of URLs from HTML with
"HTML::SimpleLinkExtor" which handles anchors, images, objects, frames,
and many other tags that can contain a URL. If you need anything more
complex, you can create your own subclass of "HTML::LinkExtor" or
"HTML::Parser". You might even use "HTML::SimpleLinkExtor" as an example
for something specifically suited to your needs.

You can use "URI::Find" to extract URLs from an arbitrary text document.

Less complete solutions involving regular expressions can save you a lot
of processing time if you know that the input is simple. One solution
from Tom Christiansen runs 100 times faster than most module based
approaches but only extracts URLs from anchors where the first attribute
is HREF and there are no other attributes.

#!/usr/bin/perl -n00
# qxurl - tchrist@xxxxxxxx
print "$2\n" while m{
< \s*
A \s+ HREF \s* = \s* (["']) (.*?) \g1
\s* >
}gsix;



--------------------------------------------------------------------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.
.