Re: "negative" regexp



Martien Verbruggen wrote:
On Thu, 31 Jan 2008 01:54:11 +0100,
Petr Vileta <stoupa@xxxxxxxxxxxxx> wrote:
Michele Dondi wrote:
On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
<stoupa@xxxxxxxxxxxxx> wrote:

I'm tending to not use HTMP parsers because these construct a huge
hashes and this is usually not needed for my purposes.

Huh?!? Evidence?


Example for convert any basic html page to plain text.

# remove all except body content
$html=~s/^.+?<body.*?>(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s/<script.+?<\/script>//sig;
# remove all images
$html=~s/<img\s+.+?>//sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or <br> with new line
$html=~s/(<\/tr>|<br>)/\n/sig;
# remove all remaining html tags

$html=~s/<.+?>//sg;

Now I have plain text.

No, you don't. At least, you probably do, but that's only because HTML
files are plain text to start off with. What you do not have is a file
completely cleared of HTML markup. And you also possibly have removed
bits of text that you meant to leave in place.

What texts? Image's titles and alts? Links (anchors)? Form fields? Unimportant for me in concrete case.

Yes, this way is not ideal but is quickly and
consumpt low memory.

If by "not ideal" you mean incorrect, you're right.

No, I mean not ideal for using universally. I have concrete goal and I use as minimal resource as possible. For example if I want to extract clicable email addresses from html source I need to extract all
/href=['"]*mailto:\s*(.+?)['"\s>/
only.

You really need a HTML parser to do this correctly, and it's simply
not as trivial as you seem to think to roll one yourself.

Yes, HTML parse know to parse correctly but sometime fail on not valid html pages. For example I saw many times pages generated by PHP from templates, which contain <head> or <body> tags twice or more ;-)

You still haven't given any evidence for your statement that HTML
parsers construct huge hashes. I don't believe they necessarily, or
ever, do. Even if that was simply a clumsy attempt to make a more
general statement about why an HTML parser isn't going to work for
you, I'd still like to hear some clarification. What is the high
performance task that you need to perfomr on your memory starved
machine that
doesn't allow a HTML parser?

HTML:Parser and WWW:Mechanize are good modules but in many case these are "too big gun" :-)
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)

Please reply to <petr AT practisoft DOT cz>

.



Relevant Pages

  • Re: Detecting refresh
    ... detect that it has been called as a result of the user pressing Refresh/Reload? ... I doubt that the answer will involve HTML, but since I'm not sure if it can be done at all, and may involve HTML, cookies, JavaScript and goodness knows what else, I decided to start here. ... store this key to database and set second field, ... Petr Vileta, Czech republic ...
    (comp.infosystems.www.authoring.html)
  • Re: "negative" regexp
    ... hashes and this is usually not needed for my purposes. ... Example for convert any basic html page to plain text. ... You really need a HTML parser to do this correctly, ...
    (comp.lang.perl.misc)
  • OT raibow
    ... I want to generate html object with all "safe" colors but in rainbow order ... Petr Vileta, Czech republic ...
    (comp.lang.perl.misc)
  • Re: "negative" regexp
    ... Michele Dondi wrote: ... I was *just* commenting on you claim that HTML parsing modules "build ... Maybe this is not correct definition of hash structure in memory, but maybe is near to true;-) In other word when you use my way and dump all memory occupied by perl script into file then this file may be say about 200kB. ... Petr Vileta, Czech republic ...
    (comp.lang.perl.misc)
  • Re: how check new URL of redirected page
    ... For meta element base redirections is successful some like this ... # I precede that html page is in variable $content ... Petr Vileta, Czech republic ...
    (comp.lang.perl.misc)