Re: deleting HTML tag...but not everyone

From: James Edward Gray II (james_at_grayproductions.net)
Date: 07/29/04


Date: Thu, 29 Jul 2004 08:49:49 -0500
To: Francesco del Vecchio <f_delvecchio@yahoo.com>

On Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote:

> Hi guys,

Hello.

> I have a problem with a Regular expression.
> I have to delete from a text all HTML tags but not the DIV one
> (keeping all the parameters in the tag).

This is a complex problem. Your solution is pretty naive and will only
work on a tight set of HTML, formatted as you expect it to be.

I'm not saying that's a problem. If you know your HTML will stay
simple, it isn't.

However, if you need or even think you may someday need a more robust
approach, you should check out the HTML parsing modules on the CPAN.

> I've done this:
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #!/usr/bin/perl
> use strict;

I would add:

use warnings;

This doesn't do anything for you here, but it's a good habit to build.
It often makes finding errors much easier.

> my $test=<<EOS;
> <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
> </head><body><font face="Courier New" size=2>
> =========SUPER SAVING========= <br>
> -product one <br>
> -product two <br><D>
> -product three <br><dIV section=true>
> ============================== <Br></DIV>
> <br><br></font></body> </html>
> EOS
> $test=~s/<br>/\n/ig;

A little less naive might be:

$test =~ s/<\s*br\s*>/\n/ig;

Even that wouldn't catch the now common <br /> though. Again, use a
module if this kind of thing is important.

> $test=~s/<^[DIV](.*?)>//ig;

This is currently removing zero tags. You are asking for a <, followed
by the beginning of the string (^). That is impossible, and thus never
matches. I believe you meant [^DIV]+, which means one or more non D,
I, or V characters, but that won't work either for reasons you pointed
out.

Here's a simple fix:

$test =~ s/<(?!\/?DIV)[^>]+>//ig;

That searches for a <, then uses a negative look-ahead assertion to
verify that a DIV or /DIV is not next, and finally grabs everything up
to the next >. It works on the example you provided.

I know I sound like a broken record, but I must again stress how weak
this is. If the HTML contains a < DIV> (note the space), it won't work
properly. Again, parsing HTML is painful, use a module and benefit
from the suffering of others if you need an intelligent solution.

> print $test;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hope that helps.

James

P.S. You can use whitespace (blanks lines and spaces) to pretty up
your code a little. Your eyes will thank you. Don't worry, it's free!
  ;)



Relevant Pages

  • Re: word webpages
    ... The ther are som tags with no closing tags DreamWeaver would remove what ever was causing these problems. ... Just create a simple document and save as HTML Make sure it has some type of formatting. ... XML all versions ...
    (microsoft.public.mac.office.word)
  • Re: macro and cl-who help
    ... Lisp, but... ... you back into the "walking forms as HTML data" mode, ... This would have been extensible with user-defined tags, ... HTML tags are macros can be functions: ...
    (comp.lang.lisp)
  • Re: html scraping
    ... Not for parsing HTML! ... DOM and SimpleXML are the right tools here. ... parser that can deal with missing end tags. ... -- If a close tag is seen, push it on the stack. ...
    (comp.lang.php)
  • Re: Volunteer work:)- new Kona Coffee Farmers site
    ... SEO is search engine optimization, which concerns itself with how well your page is indexed by Google or the other search engines. ... Good SEO involves many aspects of the page design, including well-structured HTML documents, appropriate HTML tags and tags, semantic HTML, keyword-optimized URLs, a good domain name, and copious, keyword-dense content. ...
    (alt.coffee)
  • Re: Volunteer work:)- new Kona Coffee Farmers site
    ... SEO is search engine optimization, which concerns itself with how well your page is indexed by Google or the other search engines. ... Good SEO involves many aspects of the page design, including well-structured HTML documents, appropriate HTML tags and tags, semantic HTML, keyword-optimized URLs, a good domain name, and copious, keyword-dense content. ...
    (alt.coffee)

Loading