Script Help

From: Kev (karigna_at_verizon.net)
Date: 10/30/03


Date: Thu, 30 Oct 2003 21:22:35 GMT

I'm writing a script that is part of a larger script to index a defined list
of websites. The portion that I'm working on is used to find all pages
ending in .htm / .html so that I can search those pages and index them. I
got the script to map out all the links. Can anyone help in eliminating the
non .htm / html links obtained?

#!/usr/bin/perl

use HTML::LinkExtor;
use LWP::Simple;

$base_url = "http://www.cnn.com";
$parser=HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links=$parser->links;

foreach $linkarray(@links)
{
    my @element = @$linkarray;
    my $elt_type = shift @element;
    while (@element)
    {
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
    }
}

for (sort keys %seen)
{
    print $_, "\n";
}

K.



Relevant Pages