Re: What kind of tcl tools would help me parse and use html info?



Bruce Hartweg wrote:
Larry W. Virden wrote:
I have a need to write a tool to do this:

fetch an html http URL
parse the html
Look through the A tags for some specific phrases
[...]

others have already mentioned htmlparse or an xml parser, but if
you have invalid html these will puke (and there is still plenty
of bad html out there). I have done web scraping in the past, and
often a simple RE will work to yank all the links out

set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}

foreach {tag href txt} [regexp -all -inline $RE $html] { [...] }


.... and then you have the other problem, namely that any
regexp you devise is likely to give wrong results on
valid HTML (there's actually quite a bit of valid HTML
out there, too ...).

Note that this isn't perfect either, if someone has a URL with
embedded quotes this will choke, miss it (although it only
misses that particular link, it won't stop handling the rest
of the file)

There's quite a few things wrong with the above regexp, actually.
(I can see four specific problems, including the one you've
already mentioned, without even looking at it too hard; and there
are no doubt many others.)

The regexp/screen-scraping approach can be made to work reasonably
not too badly as long as you're dealing with a known quantity --
if you only need to screen-scrape a specific set of known sites,
you can probably hack up a regexp that will handle the kind of HTML
that those particular sites happen to be producing at the time --
but if you need to handle arbitrary purported HTML fetched from
arbitrary web sites, you really need a general-purpose tag soup
parser.

The htmlparser module in tcllib and tDOM's html parser do a reasonably
good job on tag soup, IME. I'd still recommend using one of those
instead of regexps.


--Joe English
.



Relevant Pages

  • Re: removing Whitespace using regexp
    ... html and then write a parser to parse the properly formatted html. ... That way you can get rid of your whitespace problem and deal with the cosmos ... the remaining text using regexp. ... Here you can see some white space ...
    (comp.lang.ruby)
  • Re: HTML scraping
    ... I've read the "Writing HTML parser wasn't as hard as I thought it'd be" ... regexp and the full DOM monster. ... you can still infer the semantics from the physical ...
    (comp.lang.lisp)
  • Re: Reducing RegEx (pcre)
    ... > Jenda Krynicky wrote: ... were sanitizing some HTML you got from outside (or even from the ... some other tag it will not match, but the regexp would be insane. ...
    (perl.beginners)
  • regexp and stack overflow
    ... it works well for some html file but crash over other with the following ... RegexpError: Stack overflow in regexp matcher: ... strip out all the contents of scripts, all the html tags with their ... the prog failes for a file having the following parts for script: ...
    (comp.lang.ruby)
  • Re: Html Parser?
    ... DIHtmlParser is a very fast, Unicode HTML parser. ... into its various pieces (tag, text, comments, etc.) and feeds one individual ...
    (borland.public.delphi.thirdpartytools.general)