Re: "negative" regexp
- From: "Petr Vileta" <stoupa@xxxxxxxxxxxxx>
- Date: Thu, 31 Jan 2008 01:54:11 +0100
Michele Dondi wrote:
On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"Example for convert any basic html page to plain text.
<stoupa@xxxxxxxxxxxxx> wrote:
I'm tending to not use HTMP parsers because these construct a huge
hashes and this is usually not needed for my purposes.
Huh?!? Evidence?
# remove all except body content
$html=~s/^.+?<body.*?>(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s/<script.+?<\/script>//sig;
# remove all images
$html=~s/<img\s+.+?>//sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or <br> with new line
$html=~s/(<\/tr>|<br>)/\n/sig;
# remove all remaining html tags
$html=~s/<.+?>//sg;
Now I have plain text. Yes, this way is not ideal but is quickly and consumpt low memory.
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to <petr AT practisoft DOT cz>
.
- Follow-Ups:
- Re: "negative" regexp
- From: Tad J McClellan
- Re: "negative" regexp
- From: Martien Verbruggen
- Re: "negative" regexp
- References:
- "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- From: Abigail
- Re: "negative" regexp
- From: Petr Vileta
- "negative" regexp
- Prev by Date: Re: Can't get PAR packager to run pp
- Next by Date: FAQ 1.3 Which version of Perl should I use?
- Previous by thread: Re: "negative" regexp
- Next by thread: Re: "negative" regexp
- Index(es):
Relevant Pages
|