Re: "negative" regexp



Michele Dondi wrote:
On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
<stoupa@xxxxxxxxxxxxx> wrote:

I'm tending to not use HTMP parsers because these construct a huge
hashes and this is usually not needed for my purposes.

Huh?!? Evidence?


Example for convert any basic html page to plain text.

# remove all except body content
$html=~s/^.+?<body.*?>(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s/<script.+?<\/script>//sig;
# remove all images
$html=~s/<img\s+.+?>//sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or <br> with new line
$html=~s/(<\/tr>|<br>)/\n/sig;
# remove all remaining html tags

$html=~s/<.+?>//sg;

Now I have plain text. Yes, this way is not ideal but is quickly and consumpt low memory.
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)

Please reply to <petr AT practisoft DOT cz>

.



Relevant Pages

  • Re: New Imformation: Also Kellys Line 227 Left: Set IE Fonts
    ... that is different than posting html. ... included via plain text and are on rare occasions here. ... 2004 Windows MVP "Winny" Award ... > *what notepad is or where I can find it*--it was introduced in Windows 95 ...
    (microsoft.public.windowsxp.general)
  • Re: New Imformation: Also Kellys Line 227 Left: Set IE Fonts
    ... that is different than posting html. ... included via plain text and are on rare occasions here. ... 2004 Windows MVP "Winny" Award ... > *what notepad is or where I can find it*--it was introduced in Windows 95 ...
    (microsoft.public.windowsxp.customize)
  • Re: New Imformation: Also Kellys Line 227 Left: Set IE Fonts
    ... that is different than posting html. ... included via plain text and are on rare occasions here. ... 2004 Windows MVP "Winny" Award ... > *what notepad is or where I can find it*--it was introduced in Windows 95 ...
    (microsoft.public.windowsxp.basics)
  • Re: Font type frustration
    ... If someone sends you mail using HTML, you see whatever font and formatting ... send them Plain Text mail. ... Except I have a sneaking feeling that maybe your preference for plain mail ...
    (microsoft.public.mac.office.entourage)
  • Re: New Imformation: "Out Out damned html!" --Macbeth?:"
    ... "Chad Harris" wrote in message ... > 1) If I'm posting in HTML again, it's really difficult to tell how. ... > and newsgroups set to plain text on the Tools>Options send. ... > Troubleshooting Windows XP ...
    (microsoft.public.windowsxp.customize)