Re: What kind of tcl tools would help me parse and use html info?
- From: Bruce Hartweg <bruce-news@xxxxxxxxxx>
- Date: Fri, 24 Mar 2006 08:04:35 -0600
Larry W. Virden wrote:
I have a need to write a tool to do this:others have already mentioned htmlparse or an xml parser, but if
fetch an html http URL
parse the html
Look through the A tags for some specific phrases
For each one found, check a file cache. If the URL associated with the
tag is in the cache, see if it has been modified since it was placed
into the cache. If not, continue.
If it has been modified, or if it doesn't exist in the cache, then
fetch the URL, place into the cache, and touch to make the cache copy
have the date and time from the web site.
For one of the specific phrases, instead of caching the file, treat it
as the next html to parse and search.
When one specific term is no longer found, application is finished.
The only other possible thing for the algorithm above is that one of
the URLs is the URL of a CGI with values. The other URLs are just
static HTML pages.
What are some examples using some of the Tcl tools for parsing that
fetched file and searching the A tags for phrases?
you have invalid html these will puke (and there is still plenty
of bad html out there). I have done web scraping in the past, and
often a simple RE will work to yank all the links out
set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}
foreach {tag href txt} [regexp -all -inline $RE $html] {
}
Note that this isn't perfect either, if someone has a URL with
embedded quotes this will choke, miss it (although it only
misses that particular link, it won't stop handling the rest
of the file)
Bruce
.
- Follow-Ups:
- Re: What kind of tcl tools would help me parse and use html info?
- From: Joe English
- Re: What kind of tcl tools would help me parse and use html info?
- From: Cameron Laird
- Re: What kind of tcl tools would help me parse and use html info?
- References:
- What kind of tcl tools would help me parse and use html info?
- From: Larry W. Virden
- What kind of tcl tools would help me parse and use html info?
- Prev by Date: Re: wiki incr tcl erased
- Next by Date: Howto bind the <<ListboxSelect>> event
- Previous by thread: Re: What kind of tcl tools would help me parse and use html info?
- Next by thread: Re: What kind of tcl tools would help me parse and use html info?
- Index(es):
Relevant Pages
|