FAQ 9.4 How do I remove HTML from a string?



This message is one of several periodic postings to comp.lang.perl.misc
intended to make it easier for perl programmers to find answers to
common questions. The core of this message represents an excerpt
from the documentation provided with Perl.

--------------------------------------------------------------------

9.4: How do I remove HTML from a string?

The most correct way (albeit not the fastest) is to use HTML::Parser
from CPAN. Another mostly correct way is to use HTML::FormatText which
not only removes HTML but also attempts to do a little simple formatting
of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like
"s/<.*?>//g", but that fails in many cases because the tags may continue
over line breaks, they may contain quoted angle-brackets, or HTML
comment may be present. Plus, folks forget to convert entities--like
"&lt;" for example.

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program
in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a
solution:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on
text like this:

<!-- This section commented out.
<B>You can't see me!</B>
-->



--------------------------------------------------------------------

Documents such as this have been called "Answers to Frequently
Asked Questions" or FAQ for short. They represent an important
part of the Usenet tradition. They serve to reduce the volume of
redundant traffic on a news group by providing quality answers to
questions that keep coming up.

If you are some how irritated by seeing these postings you are free
to ignore them or add the sender to your killfile. If you find
errors or other problems with these postings please send corrections
or comments to the posting email address or to the maintainers as
directed in the perlfaq manual page.

Note that the FAQ text posted by this server may have been modified
from that distributed in the stable Perl release. It may have been
edited to reflect the additions, changes and corrections provided
by respondents, reviewers, and critics to previous postings of
these FAQ. Complete text of these FAQ are available on request.

The perlfaq manual page contains the following copyright notice.

AUTHOR AND COPYRIGHT

Copyright (c) 1997-2002 Tom Christiansen and Nathan
Torkington, and other contributors as noted. All rights
reserved.

This posting is provided in the hope that it will be useful but
does not represent a commitment or contract of any kind on the part
of the contributers, authors or their agents.
.



Relevant Pages