Re: html parsing



Bart <bart_btob@xxxxxxxxx> writes:

I would like to retrieve a page from one of my favorite sites. The
section I am interested in starts with a html header (in this case
h2), followed by a table, with all the html formatting mixed in
(fonts, spans, etc.). Is there an easy way to pull out just the h2
header and convert the table so each row becomes a tcl list?

Here's a snippet of what I'm using to do something similar,
looking for a table with a known string in the first row and
extracting its contents.

Hope it helps get you started!


Regards,
Ian



package require htmlparse
package require struct

proc html2data s {
::struct::tree x
::htmlparse::2tree $s x
::htmlparse::removeVisualFluff x

set data [list]

x walk root q {
if {([x get $q type] eq "PCDATA") &&
[string match R\u00e6kke/pulje [x get $q data]]} {

set p $q
for {set i 3} {$i} {incr i -1} {set p [x parent $p]}
foreach {row} [lrange [x children $p] 1 end] {

......
}
break
}
}
return $data
}
.