Re: "negative" regexp
- From: Martien Verbruggen <mgjv@xxxxxxxxxxxxxxxxxx>
- Date: Thu, 31 Jan 2008 21:44:39 +1100
On Thu, 31 Jan 2008 01:54:11 +0100,
Petr Vileta <stoupa@xxxxxxxxxxxxx> wrote:
Michele Dondi wrote:
On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"Example for convert any basic html page to plain text.
<stoupa@xxxxxxxxxxxxx> wrote:
I'm tending to not use HTMP parsers because these construct a huge
hashes and this is usually not needed for my purposes.
Huh?!? Evidence?
# remove all except body content
$html=~s/^.+?<body.*?>(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s/<script.+?<\/script>//sig;
# remove all images
$html=~s/<img\s+.+?>//sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or <br> with new line
$html=~s/(<\/tr>|<br>)/\n/sig;
# remove all remaining html tags
$html=~s/<.+?>//sg;
Now I have plain text.
No, you don't. At least, you probably do, but that's only because HTML
files are plain text to start off with. What you do not have is a file
completely cleared of HTML markup. And you also possibly have removed bits
of text that you meant to leave in place.
Yes, this way is not ideal but is quickly and consumpt
low memory.
If by "not ideal" you mean incorrect, you're right.
You really need a HTML parser to do this correctly, and it's simply not
as trivial as you seem to think to roll one yourself.
You still haven't given any evidence for your statement that HTML
parsers construct huge hashes. I don't believe they necessarily, or
ever, do. Even if that was simply a clumsy attempt to make a more
general statement about why an HTML parser isn't going to work for you,
I'd still like to hear some clarification. What is the high performance
task that you need to perfomr on your memory starved machine that
doesn't allow a HTML parser?
Martien
--
|
Martien Verbruggen | The Second Law of Thermodenial: In any closed
| mind the quantity of ignorance remains
| constant or increases.
.
- Follow-Ups:
- Re: "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- References:
- "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- From: Abigail
- Re: "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- From: Petr Vileta
- "negative" regexp
- Prev by Date: FAQ 1.4 What are Perl 4, Perl 5, or Perl 6?
- Next by Date: Easiest Way For Making Money
- Previous by thread: Re: "negative" regexp
- Next by thread: Re: "negative" regexp
- Index(es):
Relevant Pages
|