need help reading source code: HTML::Parser

ioneabu_at_yahoo.com
Date: 12/31/04


Date: 31 Dec 2004 08:41:54 -0800

I was curious about why using regex for parsing HTML was so terrible,
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.

In doing a little searching through the newsgroup, I found a lot of
people saying HTML parsing with regex is always a bad idea but not
explaining clearly why.

My next thought was to read through the code of HTML::Parser and get a
general idea of how they do it or at least how complicated the process
really is.

I used IE 6 to look at the source at cpan.org and the ctrl-f find
command to search through the document. It seems that all of the work
is done in a sub named parse. For example:

$p->parse();

I have searched up and down the source for HTML::Parser and I cannot
find a sub parse. There is a sub parse_file which calls parse.

I searched for any use, require, or do statements and found:

require HTML::Entities;

which I thought might be useful, but was not what I was looking for.

So where is this parse sub? If it is not in HTML::Parser, where is it
and how is HTML::Parser importing it?

Thanks!

wana



Relevant Pages

  • RE: subclassing HTML::Parser
    ... : I've created a module that uses HTML::Parser to parse some ... HTML and create a tree structure. ... sub html_to_htmltree { ...
    (perl.beginners)
  • Parsing HTML tables
    ... I'm searching for a way to write a beautidull code which parse an HTML ...
    (comp.lang.ruby)
  • reuse code inquiry
    ... I am a perl beginner and I am suggested to parse HTML by using ... sub parse_html { ... # incomplete tag. ... if ($routine eq "") { ...
    (comp.lang.perl.misc)
  • Re: Fails then continues without error
    ... I think the problem has to do with WORD reading the HTML ... Set TBL = IE.document.getelementsbytagname ... Set TBLRows = TBL.Item.Rows ... End Sub ...
    (microsoft.public.excel.programming)
  • Re: any pointers please? combine words script
    ... use CGI qw/:standard/; ... # script is in development, ... # the html in another place ... sub get_html { ...
    (comp.lang.perl.misc)