Re: "negative" regexp



On Thu, 31 Jan 2008 01:54:11 +0100,
Petr Vileta <stoupa@xxxxxxxxxxxxx> wrote:
Michele Dondi wrote:
On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
<stoupa@xxxxxxxxxxxxx> wrote:

I'm tending to not use HTMP parsers because these construct a huge
hashes and this is usually not needed for my purposes.

Huh?!? Evidence?


Example for convert any basic html page to plain text.

# remove all except body content
$html=~s/^.+?<body.*?>(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s/<script.+?<\/script>//sig;
# remove all images
$html=~s/<img\s+.+?>//sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or <br> with new line
$html=~s/(<\/tr>|<br>)/\n/sig;
# remove all remaining html tags

$html=~s/<.+?>//sg;

Now I have plain text.

No, you don't. At least, you probably do, but that's only because HTML
files are plain text to start off with. What you do not have is a file
completely cleared of HTML markup. And you also possibly have removed bits
of text that you meant to leave in place.

Yes, this way is not ideal but is quickly and consumpt
low memory.

If by "not ideal" you mean incorrect, you're right.

You really need a HTML parser to do this correctly, and it's simply not
as trivial as you seem to think to roll one yourself.

You still haven't given any evidence for your statement that HTML
parsers construct huge hashes. I don't believe they necessarily, or
ever, do. Even if that was simply a clumsy attempt to make a more
general statement about why an HTML parser isn't going to work for you,
I'd still like to hear some clarification. What is the high performance
task that you need to perfomr on your memory starved machine that
doesn't allow a HTML parser?

Martien
--
|
Martien Verbruggen | The Second Law of Thermodenial: In any closed
| mind the quantity of ignorance remains
| constant or increases.
.



Relevant Pages

  • Re: New Imformation: Also Kellys Line 227 Left: Set IE Fonts
    ... that is different than posting html. ... included via plain text and are on rare occasions here. ... 2004 Windows MVP "Winny" Award ... > *what notepad is or where I can find it*--it was introduced in Windows 95 ...
    (microsoft.public.windowsxp.general)
  • Re: New Imformation: Also Kellys Line 227 Left: Set IE Fonts
    ... that is different than posting html. ... included via plain text and are on rare occasions here. ... 2004 Windows MVP "Winny" Award ... > *what notepad is or where I can find it*--it was introduced in Windows 95 ...
    (microsoft.public.windowsxp.customize)
  • Re: New Imformation: Also Kellys Line 227 Left: Set IE Fonts
    ... that is different than posting html. ... included via plain text and are on rare occasions here. ... 2004 Windows MVP "Winny" Award ... > *what notepad is or where I can find it*--it was introduced in Windows 95 ...
    (microsoft.public.windowsxp.basics)
  • Re: Font type frustration
    ... If someone sends you mail using HTML, you see whatever font and formatting ... send them Plain Text mail. ... Except I have a sneaking feeling that maybe your preference for plain mail ...
    (microsoft.public.mac.office.entourage)
  • Re: New Imformation: "Out Out damned html!" --Macbeth?:"
    ... "Chad Harris" wrote in message ... > 1) If I'm posting in HTML again, it's really difficult to tell how. ... > and newsgroups set to plain text on the Tools>Options send. ... > Troubleshooting Windows XP ...
    (microsoft.public.windowsxp.customize)