HTML::TokeParser and Matching
- From: Sharif Islam <mislam@xxxxxxxxxxxxx>
- Date: Tue, 28 Nov 2006 15:42:39 -0600
I have a webpage where I need to look for the string 'Matches: n' (n is any number). The html is not that well structured, so I am having some difficulties parsing the right part. If the match is not zero, I want to grab the remaining text. (html code below)
use strict;
use LWP::Simple;
use HTML::TokeParser;
use Data::Dumper;
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.somewebpage.com/id=396" ) or die $!;
my $stream = HTML::TokeParser->new( \$content ) or die $!;
my ($tag, $headline, $url,$p);
while (my $p = $stream->get_tag("td")) {
my $text = $stream->get_trimmed_text("/td");
if ($text =~ /Matches/)
{ print $text; }
}
____HTML___
<tr><td>Matches:</td><td>3</td></table>
<hr size=1>
<table width="100%" cellpadding=0 cellspacing=0>
<tr><td nowrap><br>
I want to GRAB THIS PART <br>
<img hspace=2 src="/gif/dot.gif" alt=" "><A HREF="http://www.somepage.com">Link1</A><br>
</td><td nowrap><br><br> 01/31/2007<br></td>
<tr>
<td nowrap><br>
I want to GRAB THIS ALSO<br>
<img hspace=2 src="/gif/dot.gif" alt=" ">I want to GRAB THIS ALSO<br>
<img hspace=4 src="/gif/dot.gif" alt=" ">I want to GRAB THIS ALSO<br>
<img hspace=6 src="/gif/dot.gif" alt=" "><A HREF="www.www.com">Link2</A><br>
</td><td nowrap><br><br><br><br> 02/01/2007+<br></td>
<tr>
<td nowrap><br>
.......
</table>
.
- Prev by Date: Re: Bareword errors?
- Next by Date: Date function
- Previous by thread: Perl/Mail Suggestion....
- Next by thread: Date function
- Index(es):
Relevant Pages
|
|