Re: Clipping ALL Occurrences of a Regex in an HTML File?

From: Chris Devers (cdevers_at_pobox.com)
Date: 02/09/05


Date: Wed, 9 Feb 2005 15:03:28 -0500 (EST)
To: Dan Armstrong <ddarmstrong@gmail.com>

On Wed, 9 Feb 2005, Dan Armstrong wrote:

> I'm using a regular expression to extract text from an html file.

Why?

Regular expressions are really bad at analyzing complex, frequently
malformed data like HTML. Your request is an example of that: you're
matching on a very specific <font> tag, but what if the tag is
different? Legit HTML can have the tar attributes in different order, so
that tags like these are all functionally identical:

    <FONT SIZE=2 COLOR="#0000FF">
    <FONT COLOR="#0000FF" SIZE=2>
    <font size="2" color="#0000FF">
    <font size="2" color="#00F">

These would all need separate expressions, or an over-complex expression
to capture them all at once. It's painful and there's a vast number of
such quirks to account for.

Why bother fighting it this way?

You're *much* better off if you attack the problem with a proper parser,
such as HTML::Parser, HTML::SimpleParse, or HTML::TokeParser::Simple:

    <http://cpan.uwinnipeg.ca/dist/HTML-Parser>
    <http://cpan.uwinnipeg.ca/dist/HTML-SimpleParse>
    <http://cpan.uwinnipeg.ca/dist/HTML-TokeParser-Simple>

Each of these may have some small learning curve, but once you get going
with it, analyzing data like HTML gets *much* easier to do.

The path you're on now really isn't worth bothering with. Use a parser.

-- 
Chris Devers


Relevant Pages

  • Re: Parsing Baseball Stats
    ... HTML parsing is one of those slippery slopes - or perhaps "tar babies" might ... But HTML tags expressions *do* nest - lists within lists, ... And for applications such as this, pyparsing ...
    (comp.lang.python)
  • Re: Aligning text in frames
    ... No offence intended, but it sounds like you need a basic HTML tutorial. ... these expressions are used as a sort of social smoothing oil. ... courses because of their shoulder to shoulder closeness in battle ...
    (alt.html)
  • Re: Need Help Please - which software is best going from Front Page 20
    ... FrontPage has been discontinued, ... is no actual upgrade, the term is for marketing purposes for pricing. ... grabbing the html code from Front Page. ... There are so many different versions of Expressions and now Expressions 2 ...
    (microsoft.public.frontpage.client)
  • Re: Regular Expression Help
    ... That's exactly what I am trying to do with the data within the HTML. ... expressions and code you listed don't apply to the HTML I posted correct? ... I probably don't understand how regex works... ... I am basically trying to extract the team, quarter, score ...
    (microsoft.public.dotnet.languages.csharp)

Loading