Parsing HTML
- From: scott@xxxxxxxxxxxxxxx (Scott Taylor)
- Date: Mon, 29 Aug 2005 10:33:16 -0700 (PDT)
Hi,
I suck at regex, but getting better. :)
I'm probably reinventing the wheel here, but I tried to get along with
HTML::Parser and just couldn't get it to do anything. To confusing, I
think.
I simply want to get a list or real words from an HTML string, minus all
the HTML stuff. For example:
$a = 'This is a line of HTML:people write strange things here<br>
and hardly ever follow proper<p>
syntax A&B suck at spelling as well<br>
So I need to clean it up and strip out all<br>
words less then 3 characters in length.<p>
Later the words will go into an indexer for<br>
searching a database';
$a =~ s/<[^>]*>//gs;
$a =~ s/&/&/gs; # probably need to add more like this
@data = split (/ /,$a);
foreach $b (@data) {
foreach $b (split (/\n/,$b)){
foreach $b (split (/:/,$b)){
$b =~ s/^\s+//;
$b =~ s/\s+$//;
$b =~ s/\n//g;
$b =~ s/\c//g;
$b =~ s/[,.-;?]//gs;
if ($b and (length($b) > 3)){
print "D$b\n";
}
}
}
}
Is there a better, maybe more eligant, way to do this? I don't mind to
use HTML::Parser if I could only figure out how.
Cheers.
--
Scott
.
- Follow-Ups:
- RE: Parsing HTML
- From: Charles K. Clarkson
- RE: Parsing HTML
- Prev by Date: Perl CGI and URL rewriting returning source instead of execution
- Next by Date: RE: Parsing HTML
- Previous by thread: Perl CGI and URL rewriting returning source instead of execution
- Next by thread: RE: Parsing HTML
- Index(es):
Relevant Pages
|