Re: deleting HTML tag...but not everyone
From: James Edward Gray II (james_at_grayproductions.net)
Date: 07/29/04
- Next message: Bolcato Chris: "Endless Loop"
- Previous message: Jenda Krynicky: "Re: deleting HTML tag...but not everyone"
- In reply to: Francesco Del Vecchio: "deleting HTML tag...but not everyone"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 29 Jul 2004 08:49:49 -0500 To: Francesco del Vecchio <f_delvecchio@yahoo.com>
On Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote:
> Hi guys,
Hello.
> I have a problem with a Regular expression.
> I have to delete from a text all HTML tags but not the DIV one
> (keeping all the parameters in the tag).
This is a complex problem. Your solution is pretty naive and will only
work on a tight set of HTML, formatted as you expect it to be.
I'm not saying that's a problem. If you know your HTML will stay
simple, it isn't.
However, if you need or even think you may someday need a more robust
approach, you should check out the HTML parsing modules on the CPAN.
> I've done this:
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #!/usr/bin/perl
> use strict;
I would add:
use warnings;
This doesn't do anything for you here, but it's a good habit to build.
It often makes finding errors much easier.
> my $test=<<EOS;
> <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
> </head><body><font face="Courier New" size=2>
> =========SUPER SAVING========= <br>
> -product one <br>
> -product two <br><D>
> -product three <br><dIV section=true>
> ============================== <Br></DIV>
> <br><br></font></body> </html>
> EOS
> $test=~s/<br>/\n/ig;
A little less naive might be:
$test =~ s/<\s*br\s*>/\n/ig;
Even that wouldn't catch the now common <br /> though. Again, use a
module if this kind of thing is important.
> $test=~s/<^[DIV](.*?)>//ig;
This is currently removing zero tags. You are asking for a <, followed
by the beginning of the string (^). That is impossible, and thus never
matches. I believe you meant [^DIV]+, which means one or more non D,
I, or V characters, but that won't work either for reasons you pointed
out.
Here's a simple fix:
$test =~ s/<(?!\/?DIV)[^>]+>//ig;
That searches for a <, then uses a negative look-ahead assertion to
verify that a DIV or /DIV is not next, and finally grabs everything up
to the next >. It works on the example you provided.
I know I sound like a broken record, but I must again stress how weak
this is. If the HTML contains a < DIV> (note the space), it won't work
properly. Again, parsing HTML is painful, use a module and benefit
from the suffering of others if you need an intelligent solution.
> print $test;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hope that helps.
James
P.S. You can use whitespace (blanks lines and spaces) to pretty up
your code a little. Your eyes will thank you. Don't worry, it's free!
;)
- Next message: Bolcato Chris: "Endless Loop"
- Previous message: Jenda Krynicky: "Re: deleting HTML tag...but not everyone"
- In reply to: Francesco Del Vecchio: "deleting HTML tag...but not everyone"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|