Re: What kind of tcl tools would help me parse and use html info?





Larry W. Virden wrote:
I have a need to write a tool to do this:

fetch an html http URL
parse the html
Look through the A tags for some specific phrases
For each one found, check a file cache. If the URL associated with the
tag is in the cache, see if it has been modified since it was placed
into the cache. If not, continue.
If it has been modified, or if it doesn't exist in the cache, then
fetch the URL, place into the cache, and touch to make the cache copy
have the date and time from the web site.
For one of the specific phrases, instead of caching the file, treat it
as the next html to parse and search.
When one specific term is no longer found, application is finished.

The only other possible thing for the algorithm above is that one of
the URLs is the URL of a CGI with values. The other URLs are just
static HTML pages.

What are some examples using some of the Tcl tools for parsing that
fetched file and searching the A tags for phrases?

others have already mentioned htmlparse or an xml parser, but if
you have invalid html these will puke (and there is still plenty
of bad html out there). I have done web scraping in the past, and
often a simple RE will work to yank all the links out

set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}

foreach {tag href txt} [regexp -all -inline $RE $html] {

}

Note that this isn't perfect either, if someone has a URL with
embedded quotes this will choke, miss it (although it only
misses that particular link, it won't stop handling the rest
of the file)

Bruce
.



Relevant Pages

  • Re: What kind of tcl tools would help me parse and use html info?
    ... fetch an html http URL ... Look through the A tags for some specific phrases ... For each one found, check a file cache. ... For one of the specific phrases, instead of caching the file, treat it ...
    (comp.lang.tcl)
  • Re: Simple caching system question
    ... building cached files for the admin itself, merely pointing out that, ... Now suppose that the "article" you talk of appears on an HTML page, ... Now if the data for any one of the modules is changed, or the picture ... modules that could affect same cache because from their point of view, ...
    (comp.lang.php)
  • Re: [PHP] PHP+MySQL website cache ? Yes/No
    ... Put your shopping chart items, rendered html items in memcached. ... I was going to make this file cache system, but I relies that for each ...
    (php.general)
  • What kind of tcl tools would help me parse and use html info?
    ... fetch an html http URL ... Look through the A tags for some specific phrases ... For each one found, check a file cache. ... as the next html to parse and search. ...
    (comp.lang.tcl)
  • Re: What kind of tcl tools would help me parse and use html info?
    ... fetch an html http URL ... Look through the A tags for some specific phrases ... For each one found, check a file cache. ... For one of the specific phrases, instead of caching the file, treat it ...
    (comp.lang.tcl)