Re: Clipping ALL Occurrences of a Regex in an HTML File?
From: Chris Devers (cdevers_at_pobox.com)
Date: 02/09/05
- Next message: Philip Tham: "RE: Hopefully simple build question!"
- Previous message: Jeff Eggen: "Re: Clipping ALL Occurrences of a Regex in an HTML File?"
- In reply to: Dan Armstrong: "Clipping ALL Occurrences of a Regex in an HTML File?"
- Next in thread: John W. Krahn: "Re: Clipping ALL Occurrences of a Regex in an HTML File?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 9 Feb 2005 15:03:28 -0500 (EST) To: Dan Armstrong <ddarmstrong@gmail.com>
On Wed, 9 Feb 2005, Dan Armstrong wrote:
> I'm using a regular expression to extract text from an html file.
Why?
Regular expressions are really bad at analyzing complex, frequently
malformed data like HTML. Your request is an example of that: you're
matching on a very specific <font> tag, but what if the tag is
different? Legit HTML can have the tar attributes in different order, so
that tags like these are all functionally identical:
<FONT SIZE=2 COLOR="#0000FF">
<FONT COLOR="#0000FF" SIZE=2>
<font size="2" color="#0000FF">
<font size="2" color="#00F">
These would all need separate expressions, or an over-complex expression
to capture them all at once. It's painful and there's a vast number of
such quirks to account for.
Why bother fighting it this way?
You're *much* better off if you attack the problem with a proper parser,
such as HTML::Parser, HTML::SimpleParse, or HTML::TokeParser::Simple:
<http://cpan.uwinnipeg.ca/dist/HTML-Parser>
<http://cpan.uwinnipeg.ca/dist/HTML-SimpleParse>
<http://cpan.uwinnipeg.ca/dist/HTML-TokeParser-Simple>
Each of these may have some small learning curve, but once you get going
with it, analyzing data like HTML gets *much* easier to do.
The path you're on now really isn't worth bothering with. Use a parser.
-- Chris Devers
- Next message: Philip Tham: "RE: Hopefully simple build question!"
- Previous message: Jeff Eggen: "Re: Clipping ALL Occurrences of a Regex in an HTML File?"
- In reply to: Dan Armstrong: "Clipping ALL Occurrences of a Regex in an HTML File?"
- Next in thread: John W. Krahn: "Re: Clipping ALL Occurrences of a Regex in an HTML File?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|