Re: html parsing
- From: Ian <no@xxxxxxxxxxxxx>
- Date: Wed, 31 Oct 2007 22:13:52 +0100
Bart <bart_btob@xxxxxxxxx> writes:
I would like to retrieve a page from one of my favorite sites. The
section I am interested in starts with a html header (in this case
h2), followed by a table, with all the html formatting mixed in
(fonts, spans, etc.). Is there an easy way to pull out just the h2
header and convert the table so each row becomes a tcl list?
Here's a snippet of what I'm using to do something similar,
looking for a table with a known string in the first row and
extracting its contents.
Hope it helps get you started!
Regards,
Ian
package require htmlparse
package require struct
proc html2data s {
::struct::tree x
::htmlparse::2tree $s x
::htmlparse::removeVisualFluff x
set data [list]
x walk root q {
if {([x get $q type] eq "PCDATA") &&
[string match R\u00e6kke/pulje [x get $q data]]} {
set p $q
for {set i 3} {$i} {incr i -1} {set p [x parent $p]}
foreach {row} [lrange [x children $p] 1 end] {
......
}
break
}
}
return $data
}
.
- References:
- html parsing
- From: Bart
- html parsing
- Prev by Date: Re: freewrap helper
- Next by Date: Re: Standard DBI Proposal
- Previous by thread: html parsing
- Index(es):