Re: "negative" regexp
- From: "Petr Vileta" <stoupa@xxxxxxxxxxxxx>
- Date: Thu, 31 Jan 2008 15:05:35 +0100
Martien Verbruggen wrote:
On Thu, 31 Jan 2008 01:54:11 +0100,What texts? Image's titles and alts? Links (anchors)? Form fields? Unimportant for me in concrete case.
Petr Vileta <stoupa@xxxxxxxxxxxxx> wrote:
Michele Dondi wrote:On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"Example for convert any basic html page to plain text.
<stoupa@xxxxxxxxxxxxx> wrote:
I'm tending to not use HTMP parsers because these construct a huge
hashes and this is usually not needed for my purposes.
Huh?!? Evidence?
# remove all except body content
$html=~s/^.+?<body.*?>(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s/<script.+?<\/script>//sig;
# remove all images
$html=~s/<img\s+.+?>//sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or <br> with new line
$html=~s/(<\/tr>|<br>)/\n/sig;
# remove all remaining html tags
$html=~s/<.+?>//sg;
Now I have plain text.
No, you don't. At least, you probably do, but that's only because HTML
files are plain text to start off with. What you do not have is a file
completely cleared of HTML markup. And you also possibly have removed
bits of text that you meant to leave in place.
No, I mean not ideal for using universally. I have concrete goal and I use as minimal resource as possible. For example if I want to extract clicable email addresses from html source I need to extract allYes, this way is not ideal but is quickly and
consumpt low memory.
If by "not ideal" you mean incorrect, you're right.
/href=['"]*mailto:\s*(.+?)['"\s>/
only.
You really need a HTML parser to do this correctly, and it's simplyYes, HTML parse know to parse correctly but sometime fail on not valid html pages. For example I saw many times pages generated by PHP from templates, which contain <head> or <body> tags twice or more ;-)
not as trivial as you seem to think to roll one yourself.
You still haven't given any evidence for your statement that HTMLHTML:Parser and WWW:Mechanize are good modules but in many case these are "too big gun" :-)
parsers construct huge hashes. I don't believe they necessarily, or
ever, do. Even if that was simply a clumsy attempt to make a more
general statement about why an HTML parser isn't going to work for
you, I'd still like to hear some clarification. What is the high
performance task that you need to perfomr on your memory starved
machine that
doesn't allow a HTML parser?
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to <petr AT practisoft DOT cz>
.
- Follow-Ups:
- Re: "negative" regexp
- From: Uri Guttman
- Re: "negative" regexp
- References:
- "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- From: Abigail
- Re: "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- From: Petr Vileta
- Re: "negative" regexp
- From: Martien Verbruggen
- "negative" regexp
- Prev by Date: FAQ 2.2 How can I get a binary version of perl?
- Next by Date: Re: Magic for object constructor wanted
- Previous by thread: Re: "negative" regexp
- Next by thread: Re: "negative" regexp
- Index(es):
Relevant Pages
|