Re: html scraping



From: Michael Fesser <neti...@xxxxxx>
Not for parsing HTML! DOM and SimpleXML are the right tools here.
With DOM methods or even just a simple XPath expression you can get
all the elements you want in a _reliable_ way.

That's almost exactly what I would have said, but you said it
first, making it unnecessary for me to say it.

http://blog.mikeseth.com/index.php?/archives/1-For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS.html

Very nice essay about why *not* to get into the spaghetti-code
quagmire of trying to use regular expressions to navigate nested
syntax such as HTML, especially when closing tags are *optional* in
early versions of HTML, making it *very* difficult to find the end
of a paragraph. DOM (for finite-sized documents not too large to
fit in memory) or SAX (for streaming live HTML that runs "forever"
but you want incremental results *as* it streams through) are of
course the way to go, if you have valid (or even almost-valid) HTML
as input.

However somebody rebutted that there doesn't exist an HTML DOM
parser that can deal with missing end tags. I wanted to rebut that,
but the blog discriminates against me because I use lynx because
I'm disabled and can't get the money needed to buy a brand-new
computer that can view images on Web pages. For my situation here:
<http://www.rawbw.com/~rem/NewPub/mySituation.html>
What I see in that blog:
To prevent automated Bots from commentspamming, please enter the
string you see in the image below in the appropriate input box. Your
comment will only be submitted if the strings match. Please ensure
that your browser supports and accepts cookies, or your comment cannot
be verified correctly.
CAPTCHA
Enter the string from the spam-prevention image above:
_____
For my proposed alternative, see my PHP-based missing-word Turing
tests, sample with source here: <http://tinyurl.com/xspamx>

<OffTopic>What I wanted to post to the blog, which would have been
OnTopic there:

A couple years ago I wrote most of a two-pass DOM HTML parser,
using a simple trick:

- First pass: Tokenize in the usual (forward) direction, to produce
a list of tokens.
- Second pass: Nestize by traversing the list of tokens in the
*reverse* direction, using a simple stack:
-- If a close tag is seen, push it on the stack.
-- If an open tag is seen, check if the top item on the stack is a
closing tag that matches it:
--- If match, consider the pair to be properly nested (ad of
course: pop the closing tag off the stack).
--- If no match, consider the open tag to be self-closing.

That simple algorithm (following the KISS rule) avoids the need of
an explicit list of tags that do and do-not self-close, such as
provided by a DTD, so you can just go ahead and parse almost any
SGML or XML document without needing to bother with getting an
appropriate DTD loaded first.
Also it protects against semi-bad HTML or XHTML that is missing
some of the "required" closing tags per the DTD it claims to be
using (or doesn't bother to declare the doctype in the first
place).
Caveat: It doesn't protect against missing open tags, or
equivalently spuriously-duplicated or left-behind close tags.
Caveat: It doesn't handle NET-enabled tags, a stupid feature of SGML
that is deprecated, and is incompatible with XML self-closing tags.

Has anybody else tried that simple algorithm?

It's DOM-only, but I suppose it could be adapted for SAX (stream,
i.e. continuation-style) SGML/XML parsing by the following
additional trick: Tokenize only until a close-tag is seen, then
immediately nestize backwards per the above algorithm until the
matching open tag is seen, self-closing any intervening open tags.
Leave the stack and list-of-earlier-tokens sitting in place, and
keep the DOM sub-object to build into the next higher level of
structure later, as you resume the tokenizing of the rest of the
stream.</OffTopic>
.