Re: What kind of tcl tools would help me parse and use html info?
- From: Michael Schlenker <schlenk@xxxxxxxxxxxxxxxx>
- Date: Fri, 24 Mar 2006 13:24:08 +0100
Larry W. Virden wrote:
I have a need to write a tool to do this:You could use the htmlparse or tdom packages to do the html parsing, but
fetch an html http URL
parse the html
Look through the A tags for some specific phrases
For each one found, check a file cache. If the URL associated with the
tag is in the cache, see if it has been modified since it was placed
into the cache. If not, continue.
If it has been modified, or if it doesn't exist in the cache, then
fetch the URL, place into the cache, and touch to make the cache copy
have the date and time from the web site.
For one of the specific phrases, instead of caching the file, treat it
as the next html to parse and search.
When one specific term is no longer found, application is finished.
The only other possible thing for the algorithm above is that one of
the URLs is the URL of a CGI with values. The other URLs are just
static HTML pages.
What are some examples using some of the Tcl tools for parsing that
fetched file and searching the A tags for phrases?
both of them like their html correct, so if you could have invalid html
files they can and do fail (trash in -> trash out).
The tdom page on the wiki has an example of a tdom script that fetches
an url and extracts all links, would probably a good start.Using the
htmlparse module from tcllib would work too.
The rest sounds like a bit of http::geturl with the -command and
probably the -channel option should work quite well.
Michael
.
- References:
- What kind of tcl tools would help me parse and use html info?
- From: Larry W. Virden
- What kind of tcl tools would help me parse and use html info?
- Prev by Date: Re: Can you give me some advise in testing?
- Next by Date: Re: Coverty
- Previous by thread: What kind of tcl tools would help me parse and use html info?
- Next by thread: Re: What kind of tcl tools would help me parse and use html info?
- Index(es):
Relevant Pages
|