Re: What kind of tcl tools would help me parse and use html info?
- From: jenglish@xxxxxxxxxxxxx (Joe English)
- Date: 24 Mar 2006 23:36:41 GMT
Bruce Hartweg wrote:
Larry W. Virden wrote:
I have a need to write a tool to do this:
fetch an html http URL
parse the html
Look through the A tags for some specific phrases
[...]
others have already mentioned htmlparse or an xml parser, but if
you have invalid html these will puke (and there is still plenty
of bad html out there). I have done web scraping in the past, and
often a simple RE will work to yank all the links out
set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}
foreach {tag href txt} [regexp -all -inline $RE $html] { [...] }
.... and then you have the other problem, namely that any
regexp you devise is likely to give wrong results on
valid HTML (there's actually quite a bit of valid HTML
out there, too ...).
Note that this isn't perfect either, if someone has a URL with
embedded quotes this will choke, miss it (although it only
misses that particular link, it won't stop handling the rest
of the file)
There's quite a few things wrong with the above regexp, actually.
(I can see four specific problems, including the one you've
already mentioned, without even looking at it too hard; and there
are no doubt many others.)
The regexp/screen-scraping approach can be made to work reasonably
not too badly as long as you're dealing with a known quantity --
if you only need to screen-scrape a specific set of known sites,
you can probably hack up a regexp that will handle the kind of HTML
that those particular sites happen to be producing at the time --
but if you need to handle arbitrary purported HTML fetched from
arbitrary web sites, you really need a general-purpose tag soup
parser.
The htmlparser module in tcllib and tDOM's html parser do a reasonably
good job on tag soup, IME. I'd still recommend using one of those
instead of regexps.
--Joe English
.
- References:
- What kind of tcl tools would help me parse and use html info?
- From: Larry W. Virden
- Re: What kind of tcl tools would help me parse and use html info?
- From: Bruce Hartweg
- What kind of tcl tools would help me parse and use html info?
- Prev by Date: Re: Finding packages
- Next by Date: returning an element in array
- Previous by thread: Re: What kind of tcl tools would help me parse and use html info?
- Next by thread: Re: What kind of tcl tools would help me parse and use html info?
- Index(es):
Relevant Pages
|