Parsing HTML




Hi,

I suck at regex, but getting better. :)

I'm probably reinventing the wheel here, but I tried to get along with
HTML::Parser and just couldn't get it to do anything. To confusing, I
think.

I simply want to get a list or real words from an HTML string, minus all
the HTML stuff. For example:

$a = 'This is a line of HTML:people write strange things here<br>
and hardly ever follow proper<p>
syntax A&amp;B suck at spelling as well<br>
So I need to clean it up and strip out all<br>

words less then 3 characters in length.<p>

Later the words will go into an indexer for<br>
searching a database';

$a =~ s/<[^>]*>//gs;
$a =~ s/&amp;/&/gs; # probably need to add more like this
@data = split (/ /,$a);
foreach $b (@data) {
foreach $b (split (/\n/,$b)){
foreach $b (split (/:/,$b)){
$b =~ s/^\s+//;
$b =~ s/\s+$//;
$b =~ s/\n//g;
$b =~ s/\c//g;
$b =~ s/[,.-;?]//gs;
if ($b and (length($b) > 3)){
print "D$b\n";
}
}
}
}

Is there a better, maybe more eligant, way to do this? I don't mind to
use HTML::Parser if I could only figure out how.

Cheers.

--
Scott
.



Relevant Pages

  • Re: Parsing HTML
    ... Jenda Krynicky said: ... >> I'm probably reinventing the wheel here, but I tried to get along with ... >> I simply want to get a list or real words from an HTML string, ...
    (perl.beginners)
  • Re: Parsing HTML
    ... Jenda Krynicky said: ... >> I'm probably reinventing the wheel here, but I tried to get along with ... >> I simply want to get a list or real words from an HTML string, ...
    (perl.beginners)
  • Re: Parsing HTML pages
    ... to return sections from within that html string? ... I want to be able to get the "text" back between two different tags. ... If so you can just read it as an XmlDocument. ...
    (microsoft.public.dotnet.languages.csharp)